{"title":"大规模研究兴趣挖掘的高效两阶段计算方法","authors":"Sha Yuan , Zhou Shao","doi":"10.1016/j.future.2025.108117","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic analysis for academic data is crucial for many scientific services, such as review recommendation, planning research funding directions. Research interest analysis faces challenges in large-scale academic data mining. Traditional methods of representing research interests, such as manual labeling, using statistical or machine learning methods, have limitations. In particular, the computation amount is unacceptable in large-scale multisource information integration. This paper presents an efficient computing method for predicting scholar interests based on the principle of large-scale recommendation systems, consisting of rough and refined sorting. In rough sorting, one-hot encoding, CHI square feature selection, TF-IDF feature extraction, and an SGD-based classifier are used to obtain several top interest labels. In refined sorting, a pre-trained SciBERT model outputs the optimal interest labels. The proposed approach offers two main advantages. Firstly, it improves computational efficiency, as directly using pre-trained models like BERT for large-scale data leads to excessive calculations. Secondly, the algorithm ensures better model performance. Feature selection in the rough sorting stage can avoid the negative impact of irrelevant papers on prediction precision, which is a problem when using pre-trained model directly.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108117"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An efficient two-stage computing method for large-scale research interest mining\",\"authors\":\"Sha Yuan , Zhou Shao\",\"doi\":\"10.1016/j.future.2025.108117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic analysis for academic data is crucial for many scientific services, such as review recommendation, planning research funding directions. Research interest analysis faces challenges in large-scale academic data mining. Traditional methods of representing research interests, such as manual labeling, using statistical or machine learning methods, have limitations. In particular, the computation amount is unacceptable in large-scale multisource information integration. This paper presents an efficient computing method for predicting scholar interests based on the principle of large-scale recommendation systems, consisting of rough and refined sorting. In rough sorting, one-hot encoding, CHI square feature selection, TF-IDF feature extraction, and an SGD-based classifier are used to obtain several top interest labels. In refined sorting, a pre-trained SciBERT model outputs the optimal interest labels. The proposed approach offers two main advantages. Firstly, it improves computational efficiency, as directly using pre-trained models like BERT for large-scale data leads to excessive calculations. Secondly, the algorithm ensures better model performance. Feature selection in the rough sorting stage can avoid the negative impact of irrelevant papers on prediction precision, which is a problem when using pre-trained model directly.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"175 \",\"pages\":\"Article 108117\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X2500411X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2500411X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
An efficient two-stage computing method for large-scale research interest mining
Semantic analysis for academic data is crucial for many scientific services, such as review recommendation, planning research funding directions. Research interest analysis faces challenges in large-scale academic data mining. Traditional methods of representing research interests, such as manual labeling, using statistical or machine learning methods, have limitations. In particular, the computation amount is unacceptable in large-scale multisource information integration. This paper presents an efficient computing method for predicting scholar interests based on the principle of large-scale recommendation systems, consisting of rough and refined sorting. In rough sorting, one-hot encoding, CHI square feature selection, TF-IDF feature extraction, and an SGD-based classifier are used to obtain several top interest labels. In refined sorting, a pre-trained SciBERT model outputs the optimal interest labels. The proposed approach offers two main advantages. Firstly, it improves computational efficiency, as directly using pre-trained models like BERT for large-scale data leads to excessive calculations. Secondly, the algorithm ensures better model performance. Feature selection in the rough sorting stage can avoid the negative impact of irrelevant papers on prediction precision, which is a problem when using pre-trained model directly.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.