在科学领域创建推荐系统数据集

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining Pub Date : 2021-08-14 DOI:10.1145/3447548.3470805

Márcia Barros, Francisco M. Couto, Matilde Pato, Pedro Ruas

{"title":"在科学领域创建推荐系统数据集","authors":"Márcia Barros, Francisco M. Couto, Matilde Pato, Pedro Ruas","doi":"10.1145/3447548.3470805","DOIUrl":null,"url":null,"abstract":"Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. \"what the recommended item is\". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Creating Recommender Systems Datasets in Scientific Fields\",\"authors\":\"Márcia Barros, Francisco M. Couto, Matilde Pato, Pedro Ruas\",\"doi\":\"10.1145/3447548.3470805\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. \\\"what the recommended item is\\\". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.\",\"PeriodicalId\":421090,\"journal\":{\"name\":\"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447548.3470805\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447548.3470805","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

推荐系统(RS)已经在许多领域得到了成功的探索，例如电影和电视节目、音乐或电子商务。在这些领域中，我们有大量的免费数据集用于测试和评估新的推荐算法。例如，Movielens和Netflix的电影数据集，Spotify的音乐数据集，亚马逊的电子商务数据集，这些都转化为应用于这些领域的大量算法。在诸如卫生和化学等科学领域，包含用户偏好信息的标准和开放获取数据集很少。首先，理解应用程序域是很重要的。“推荐的项目是什么”。第二，谁是最终用户:研究人员、药剂师、临床医生或政策制定者。第三，数据的可用性。因此，如果我们希望开发一种推荐科学项目的算法，我们无法访问包含一组用户过去偏好信息的数据集。鉴于这种限制，我们开发了一种方法，称为LIBRETTI -基于文献的科学项目推荐，其目标是创建与科学领域相关的数据集。这些数据集是基于《科学》拥有的主要知识资源:科学文献而创建的。我们将用户视为出版物的作者，将项目视为科学实体(例如化合物或疾病)，并将评级视为作者撰写的关于该实体的出版物的数量。在本教程中，我们将探讨科学领域最先进的推荐系统，解释研究文献中的命名实体识别/链接(NER/NEL)，并演示如何通过与COVID-19相关的研究文献创建用于推荐药物和疾病的数据集。我们的目标是推广LIBRETTI方法的使用，以帮助科学领域的推荐算法的发展。这些数据集是基于《科学》拥有的主要知识资源:科学文献而创建的。我们将用户视为出版物的作者，将项目视为科学实体(例如化合物或疾病)，并将评级视为作者撰写的关于该实体的出版物的数量。在本教程中，我们将探讨科学领域最先进的推荐系统，解释研究文献中的命名实体识别/链接(NER/NEL)，并演示如何通过与COVID-19相关的研究文献创建用于推荐药物和疾病的数据集。我们的目标是推广LIBRETTI方法的使用，以帮助科学领域的推荐算法的发展。更多教程信息请访问https://lasigebiotm.github.io/RecSys.Scifi/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Creating Recommender Systems Datasets in Scientific Fields

Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. "what the recommended item is". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量