A new readability measure for web documents and its evaluation on an effective web search engine

Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services Pub Date : 2016-11-28 DOI:10.1145/3011141.3011172

Yume Sasaki, Takuya Komatsuda, Atsushi Keyaki, Jun Miyazaki

{"title":"A new readability measure for web documents and its evaluation on an effective web search engine","authors":"Yume Sasaki, Takuya Komatsuda, Atsushi Keyaki, Jun Miyazaki","doi":"10.1145/3011141.3011172","DOIUrl":null,"url":null,"abstract":"In this study, we propose a readability measure for Web documents and an information retrieval system that considers readability. Previous information retrieval systems aim to identify documents that are relevant to a given query; however, as information requirements of search system users becomes increasingly diverse and complicated, systems that take such new criteria into account are constantly being introduced. In particular, the focus of our present paper is on readability. Given that the population of non-native English speakers exceeds that of native English speakers, incorporating readability into an information retrieval system is crucial. Therefore, we propose (1) a readability measure that considers document simplicity and document structure as new features for readability and (2) a score fusion method that combines relevance and readability scores. In our experimental results, we found that our proposed readability measure outperformed an existing readability measure. Moreover, we found score fusion methods using a statistical framework called a copula improved overall accuracy as compared to such existing methods as linear combination.","PeriodicalId":247823,"journal":{"name":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3011141.3011172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In this study, we propose a readability measure for Web documents and an information retrieval system that considers readability. Previous information retrieval systems aim to identify documents that are relevant to a given query; however, as information requirements of search system users becomes increasingly diverse and complicated, systems that take such new criteria into account are constantly being introduced. In particular, the focus of our present paper is on readability. Given that the population of non-native English speakers exceeds that of native English speakers, incorporating readability into an information retrieval system is crucial. Therefore, we propose (1) a readability measure that considers document simplicity and document structure as new features for readability and (2) a score fusion method that combines relevance and readability scores. In our experimental results, we found that our proposed readability measure outperformed an existing readability measure. Moreover, we found score fusion methods using a statistical framework called a copula improved overall accuracy as compared to such existing methods as linear combination.

查看原文本刊更多论文

一种新的网络文档可读性度量方法及其在一个有效的网络搜索引擎上的评价

在这项研究中，我们提出了一个Web文档的可读性度量和一个考虑可读性的信息检索系统。以前的信息检索系统旨在识别与给定查询相关的文档;然而，随着搜索系统用户的信息需求日益多样化和复杂化，考虑到这些新标准的系统不断被引入。特别地，我们当前论文的重点是可读性。鉴于非英语为母语的人口超过了英语为母语的人口，将可读性纳入信息检索系统是至关重要的。因此，我们提出(1)一种将文档简单性和文档结构作为可读性新特征的可读性度量方法，以及(2)一种结合相关性和可读性分数的评分融合方法。在我们的实验结果中，我们发现我们提出的可读性度量优于现有的可读性度量。此外，我们发现，与现有的线性组合等方法相比，使用称为copula的统计框架的分数融合方法提高了整体准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services

自引率

0.00%

发文量