Clustering experiments with the Astro benchmarking data set with semantic document embeddings – off-the-shelf vs. custom embeddings created from citations, text, and both
{"title":"Clustering experiments with the Astro benchmarking data set with semantic document embeddings – off-the-shelf vs. custom embeddings created from citations, text, and both","authors":"Paul Donner","doi":"10.55835/643fed628e529cfebf33f797","DOIUrl":null,"url":null,"abstract":"What accounts for the observed better quality of publication-level topical science clustering solutions which use only citation relations as input data, compared to those using sophisticated semantic similarity data derived from both citations and textual terms? A survey of empirical work relevant to the concept of unconscientious referencing practices indicates that purely citation-based methods should be affected by significant ‘citation noise’, unlike text-based methods. This study continues work with the Astro benchmarking data set for bibliometric clustering by applying semantic representation learning techniques to scientific documents in order to isolate the clustering performance difference between direct citations and textual terms. We investigate variants of Random Indexing embeddings learned on this data set and one pre-trained off-the-shelf semantic document embedding, SPECTER. The evaluation is performed with four previously introduced validation data sets but using a newly suggested clustering evaluation measure.","PeriodicalId":334841,"journal":{"name":"27th International Conference on Science, Technology and Innovation Indicators (STI 2023)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"27th International Conference on Science, Technology and Innovation Indicators (STI 2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55835/643fed628e529cfebf33f797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
What accounts for the observed better quality of publication-level topical science clustering solutions which use only citation relations as input data, compared to those using sophisticated semantic similarity data derived from both citations and textual terms? A survey of empirical work relevant to the concept of unconscientious referencing practices indicates that purely citation-based methods should be affected by significant ‘citation noise’, unlike text-based methods. This study continues work with the Astro benchmarking data set for bibliometric clustering by applying semantic representation learning techniques to scientific documents in order to isolate the clustering performance difference between direct citations and textual terms. We investigate variants of Random Indexing embeddings learned on this data set and one pre-trained off-the-shelf semantic document embedding, SPECTER. The evaluation is performed with four previously introduced validation data sets but using a newly suggested clustering evaluation measure.