Inter-document reference detection as an alternative to full text semantic analysis in document clustering

P. D. Mazière, M. Hulle
{"title":"Inter-document reference detection as an alternative to full text semantic analysis in document clustering","authors":"P. D. Mazière, M. Hulle","doi":"10.1109/MLSP.2013.6661952","DOIUrl":null,"url":null,"abstract":"We discuss here the search for inter-document references as an alternative to the grouping of document inventories based on a full text semantic analysis. The used document inventory, which is not publicly available, was provided to us by the European Union (EU) in the framework of an EU project, the aim of which was to analyse, classify, and visualise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies. For the semantic based grouping, we start from a Multi-Dimensional Scaling analysis of the document vectors, which is the result of a prior semantic analysis. As an alternative to a semantic analysis, we searched for inter-document references or direct references. Direct references are defined as terms that explicitly refer to other documents present in the inventory. We show that the grouping based on references is largely similar to the one based on semantics, but with considerably less computational efforts. In addition, the non-expert can make better use of the results, since the references are displayed as graphical webpages with hyperlinks pointing to both the referenced and the referencing document(s), and the reason of linkage. Finally, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the document inventory and to share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result-consistency.","PeriodicalId":73290,"journal":{"name":"IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Workshop on Machine Learning for Signal Processing : [proceedings]. IEEE International Workshop on Machine Learning for Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2013.6661952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We discuss here the search for inter-document references as an alternative to the grouping of document inventories based on a full text semantic analysis. The used document inventory, which is not publicly available, was provided to us by the European Union (EU) in the framework of an EU project, the aim of which was to analyse, classify, and visualise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies. For the semantic based grouping, we start from a Multi-Dimensional Scaling analysis of the document vectors, which is the result of a prior semantic analysis. As an alternative to a semantic analysis, we searched for inter-document references or direct references. Direct references are defined as terms that explicitly refer to other documents present in the inventory. We show that the grouping based on references is largely similar to the one based on semantics, but with considerably less computational efforts. In addition, the non-expert can make better use of the results, since the references are displayed as graphical webpages with hyperlinks pointing to both the referenced and the referencing document(s), and the reason of linkage. Finally, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the document inventory and to share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result-consistency.
文档间引用检测作为文档聚类中全文语义分析的替代方法
我们在这里讨论文档间引用的搜索,作为基于全文语义分析的文档清单分组的替代方案。未公开使用的文献清单是由欧盟(EU)在一个欧盟项目框架内提供给我们的,该项目的目的是分析、分类和可视化欧盟框架计划FP5和FP6中欧盟资助的社会科学和人文科学研究。该项目简称为SSH项目,旨在评估研究对欧盟政策发展的贡献。对于基于语义的分组,我们从文档向量的多维尺度分析开始,这是先验语义分析的结果。作为语义分析的替代方法,我们搜索文档间引用或直接引用。直接引用被定义为明确引用库存中存在的其他文档的术语。我们展示了基于引用的分组与基于语义的分组在很大程度上相似,但计算工作量要少得多。此外,非专家可以更好地利用结果,因为参考文献显示为图形网页,超链接指向被引用文献和引用文献,以及链接的原因。最后,我们展示了存储数据和(中间)结果的数据库和可视化结果的web服务器的组合,提供了一个强大的平台来分析文档库存,并与参与数据和计算密集型欧盟项目的所有参与者/合作者共享结果,从而保证了数据和结果的一致性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信