基于语言模型的交互式聚类和高查全率信息检索

Sima Rezaeipourfarsangi, Ningyuan Pei, Ehsan Sherkat, E. Milios
{"title":"基于语言模型的交互式聚类和高查全率信息检索","authors":"Sima Rezaeipourfarsangi, Ningyuan Pei, Ehsan Sherkat, E. Milios","doi":"10.1145/3531073.3531174","DOIUrl":null,"url":null,"abstract":"Clustering is a crucial text mining technique for organizing digital document sets, enabling users to understand their data better. It has been demonstrated that involving users can often significantly improve clustering quality. We propose a novel system that combines deep language models (SBERT, Infer-Sent, and Universal Sentence Encoder) with interactive clustering enabling users to steer the clustering algorithm towards results meaningful to them through interactive document and cluster visualizations. Our system is comprised of several visual components, each of which allows the user to apply their domain knowledge to the clustering process. The use of deep language models for representing sentences addresses the vocabulary mismatch problem that affects bag-of-words representations of documents. We employ sentence embeddings to obtain document embeddings as an input to the clustering algorithm, a modified version of K-means. We conduct a two-stage evaluation of our system. First, we evaluate the proposed clustering models in automatic clustering of various publicly available data sets, and we confirm that they are competitive with state-of-the-art. Second, we conduct a formal expert study of a specific data set consisting of our research group’s readings (research papers in machine learning, text mining, and natural language processing) over several years. The domain expert is a graduate student whose thesis is in the above field. The expert study concludes that our system is significantly better at producing meaningful clusters than the baseline system (Vis-Kt).","PeriodicalId":412533,"journal":{"name":"Proceedings of the 2022 International Conference on Advanced Visual Interfaces","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Interactive clustering and high-recall information retrieval using language models\",\"authors\":\"Sima Rezaeipourfarsangi, Ningyuan Pei, Ehsan Sherkat, E. Milios\",\"doi\":\"10.1145/3531073.3531174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clustering is a crucial text mining technique for organizing digital document sets, enabling users to understand their data better. It has been demonstrated that involving users can often significantly improve clustering quality. We propose a novel system that combines deep language models (SBERT, Infer-Sent, and Universal Sentence Encoder) with interactive clustering enabling users to steer the clustering algorithm towards results meaningful to them through interactive document and cluster visualizations. Our system is comprised of several visual components, each of which allows the user to apply their domain knowledge to the clustering process. The use of deep language models for representing sentences addresses the vocabulary mismatch problem that affects bag-of-words representations of documents. We employ sentence embeddings to obtain document embeddings as an input to the clustering algorithm, a modified version of K-means. We conduct a two-stage evaluation of our system. First, we evaluate the proposed clustering models in automatic clustering of various publicly available data sets, and we confirm that they are competitive with state-of-the-art. Second, we conduct a formal expert study of a specific data set consisting of our research group’s readings (research papers in machine learning, text mining, and natural language processing) over several years. The domain expert is a graduate student whose thesis is in the above field. The expert study concludes that our system is significantly better at producing meaningful clusters than the baseline system (Vis-Kt).\",\"PeriodicalId\":412533,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Advanced Visual Interfaces\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Advanced Visual Interfaces\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3531073.3531174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Advanced Visual Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3531073.3531174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

聚类是组织数字文档集的关键文本挖掘技术,使用户能够更好地理解他们的数据。事实证明,让用户参与通常可以显著提高聚类质量。我们提出了一个新的系统,它结合了深度语言模型(SBERT、intersent和Universal Sentence Encoder)和交互式聚类,使用户能够通过交互式文档和聚类可视化来引导聚类算法朝着对他们有意义的结果发展。我们的系统由几个可视化组件组成,每个组件都允许用户将他们的领域知识应用到聚类过程中。使用深度语言模型表示句子解决了影响文档词袋表示的词汇不匹配问题。我们使用句子嵌入来获得文档嵌入作为聚类算法的输入,这是K-means的改进版本。我们对我们的系统进行两个阶段的评估。首先,我们评估了在各种公开可用数据集的自动聚类中提出的聚类模型,并确认它们与最先进的聚类模型具有竞争力。其次,我们对一个特定的数据集进行了正式的专家研究,该数据集由我们的研究小组多年来的阅读(机器学习、文本挖掘和自然语言处理方面的研究论文)组成。领域专家是指论文在上述领域的研究生。专家研究得出结论,我们的系统在产生有意义的集群方面明显优于基线系统(Vis-Kt)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Interactive clustering and high-recall information retrieval using language models
Clustering is a crucial text mining technique for organizing digital document sets, enabling users to understand their data better. It has been demonstrated that involving users can often significantly improve clustering quality. We propose a novel system that combines deep language models (SBERT, Infer-Sent, and Universal Sentence Encoder) with interactive clustering enabling users to steer the clustering algorithm towards results meaningful to them through interactive document and cluster visualizations. Our system is comprised of several visual components, each of which allows the user to apply their domain knowledge to the clustering process. The use of deep language models for representing sentences addresses the vocabulary mismatch problem that affects bag-of-words representations of documents. We employ sentence embeddings to obtain document embeddings as an input to the clustering algorithm, a modified version of K-means. We conduct a two-stage evaluation of our system. First, we evaluate the proposed clustering models in automatic clustering of various publicly available data sets, and we confirm that they are competitive with state-of-the-art. Second, we conduct a formal expert study of a specific data set consisting of our research group’s readings (research papers in machine learning, text mining, and natural language processing) over several years. The domain expert is a graduate student whose thesis is in the above field. The expert study concludes that our system is significantly better at producing meaningful clusters than the baseline system (Vis-Kt).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信