基于聚类的信息检索中摘要效果的度量

Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames
{"title":"基于聚类的信息检索中摘要效果的度量","authors":"Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames","doi":"10.1109/SCCC51225.2020.9281189","DOIUrl":null,"url":null,"abstract":"Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.","PeriodicalId":117157,"journal":{"name":"2020 39th International Conference of the Chilean Computer Science Society (SCCC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Measuring the Effects of Summarization in Cluster-based Information Retrieval\",\"authors\":\"Arturo Curiel, Claudio Gutiérrez-Soto, Pablo-Nicolas Soto-Borquez, Patricio Galdames\",\"doi\":\"10.1109/SCCC51225.2020.9281189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.\",\"PeriodicalId\":117157,\"journal\":{\"name\":\"2020 39th International Conference of the Chilean Computer Science Society (SCCC)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 39th International Conference of the Chilean Computer Science Society (SCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SCCC51225.2020.9281189\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 39th International Conference of the Chilean Computer Science Society (SCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCCC51225.2020.9281189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

摘要是现代互联网的重要组成部分。在已经成为主要信息来源的社交网络中,用户已经习惯了精简自己的文字。内容提供商也会定期向这些平台发布简短的文本摘录。然而,随着大量小文档的不断出现,搜索引擎现在可以索引、分类和检索相关信息的数据越来越少。在这方面,需要更多的研究来显示当前的信息检索(IR)算法在面对专门的短文档集合时的可靠性,例如来自社交媒体的集合。本文通过聚类分析探讨了人类摘要和查询之间的语义接近性,以及它与IR的关系。粗略地说,k-means算法被用于根据语义相似性对两个摘要集合进行聚类:一个是英语的,一个是西班牙语的。这是为了衡量摘要如何影响基于集群的IR中的信息内容。此外,还使用了相同的算法来度量文档如何围绕一组人工生成的查询进行分组。结果表明,无论何种语言,为算法提供先前的类别知识都有助于提高基于聚类的文档分类的准确性。此外,一些证据表明总结质量对可检索性的影响:专业总结者创作的总结比大学生创作的总结更容易引起可区分的聚类。该领域的未来工作可能有助于使现有算法适应大型短文档集合,在机器学习技术不可用的情况下提高IR性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Measuring the Effects of Summarization in Cluster-based Information Retrieval
Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信