基于聚类的信息检索中摘要效果的度量

2020 39th International Conference of the Chilean Computer Science Society (SCCC) Pub Date : 2020-11-16 DOI:10.1109/SCCC51225.2020.9281189

Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames

{"title":"基于聚类的信息检索中摘要效果的度量","authors":"Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames","doi":"10.1109/SCCC51225.2020.9281189","DOIUrl":null,"url":null,"abstract":"Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.","PeriodicalId":117157,"journal":{"name":"2020 39th International Conference of the Chilean Computer Science Society (SCCC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Measuring the Effects of Summarization in Cluster-based Information Retrieval\",\"authors\":\"Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames\",\"doi\":\"10.1109/SCCC51225.2020.9281189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.\",\"PeriodicalId\":117157,\"journal\":{\"name\":\"2020 39th International Conference of the Chilean Computer Science Society (SCCC)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 39th International Conference of the Chilean Computer Science Society (SCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SCCC51225.2020.9281189\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 39th International Conference of the Chilean Computer Science Society (SCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCCC51225.2020.9281189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

摘要是现代互联网的重要组成部分。在已经成为主要信息来源的社交网络中，用户已经习惯了精简自己的文字。内容提供商也会定期向这些平台发布简短的文本摘录。然而，随着大量小文档的不断出现，搜索引擎现在可以索引、分类和检索相关信息的数据越来越少。在这方面，需要更多的研究来显示当前的信息检索(IR)算法在面对专门的短文档集合时的可靠性，例如来自社交媒体的集合。本文通过聚类分析探讨了人类摘要和查询之间的语义接近性，以及它与IR的关系。粗略地说，k-means算法被用于根据语义相似性对两个摘要集合进行聚类:一个是英语的，一个是西班牙语的。这是为了衡量摘要如何影响基于集群的IR中的信息内容。此外，还使用了相同的算法来度量文档如何围绕一组人工生成的查询进行分组。结果表明，无论何种语言，为算法提供先前的类别知识都有助于提高基于聚类的文档分类的准确性。此外，一些证据表明总结质量对可检索性的影响:专业总结者创作的总结比大学生创作的总结更容易引起可区分的聚类。该领域的未来工作可能有助于使现有算法适应大型短文档集合，在机器学习技术不可用的情况下提高IR性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Measuring the Effects of Summarization in Cluster-based Information Retrieval

Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 39th International Conference of the Chilean Computer Science Society (SCCC)

自引率

0.00%

发文量