Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
{"title":"从 HAL 出版物资料库获取文本和结构化数据","authors":"Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary","doi":"arxiv-2407.20595","DOIUrl":null,"url":null,"abstract":"HAL (Hyper Articles en Ligne) is the French national publication repository,\nused by most higher education and research organizations for their open science\npolicy. As a digital library, it is a rich repository of scholarly documents,\nbut its potential for advanced research has been underutilized. We present\nHALvest, a unique dataset that bridges the gap between citation networks and\nthe full text of papers submitted on HAL. We craft our dataset by filtering HAL\nfor scholarly publications, resulting in approximately 700,000 documents,\nspanning 34 languages across 13 identified domains, suitable for language model\ntraining, and yielding approximately 16.5 billion tokens (with 8 billion in\nFrench and 7 billion in English, the most represented languages). We transform\nthe metadata of each paper into a citation network, producing a directed\nheterogeneous graph. This graph includes uniquely identified authors on HAL, as\nwell as all open submitted papers, and their citations. We provide a baseline\nfor authorship attribution using the dataset, implement a range of\nstate-of-the-art models in graph representation learning for link prediction,\nand discuss the usefulness of our generated knowledge graph structure.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"113 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Harvesting Textual and Structured Data from the HAL Publication Repository\",\"authors\":\"Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary\",\"doi\":\"arxiv-2407.20595\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"HAL (Hyper Articles en Ligne) is the French national publication repository,\\nused by most higher education and research organizations for their open science\\npolicy. As a digital library, it is a rich repository of scholarly documents,\\nbut its potential for advanced research has been underutilized. We present\\nHALvest, a unique dataset that bridges the gap between citation networks and\\nthe full text of papers submitted on HAL. We craft our dataset by filtering HAL\\nfor scholarly publications, resulting in approximately 700,000 documents,\\nspanning 34 languages across 13 identified domains, suitable for language model\\ntraining, and yielding approximately 16.5 billion tokens (with 8 billion in\\nFrench and 7 billion in English, the most represented languages). We transform\\nthe metadata of each paper into a citation network, producing a directed\\nheterogeneous graph. This graph includes uniquely identified authors on HAL, as\\nwell as all open submitted papers, and their citations. We provide a baseline\\nfor authorship attribution using the dataset, implement a range of\\nstate-of-the-art models in graph representation learning for link prediction,\\nand discuss the usefulness of our generated knowledge graph structure.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"113 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.20595\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.20595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
HAL(Hyper Articles en Ligne)是法国国家出版物库,被大多数高等教育和研究机构用于其开放科学政策。作为一个数字图书馆,它拥有丰富的学术文献资源,但其在高级研究方面的潜力却未得到充分利用。HALvest 是一个独特的数据集,它在引文网络和 HAL 上提交的论文全文之间架起了一座桥梁。我们通过过滤 HAL 上的学术出版物来制作我们的数据集,最终得到了约 70 万篇文档,涵盖 13 个已确定领域的 34 种语言,适合语言模型训练,并产生了约 165 亿个词块(其中法语和英语分别为 80 亿和 70 亿,是代表性最强的语言)。我们将每篇论文的元数据转化为引文网络,生成有向异构图。该图包括 HAL 上唯一标识的作者、所有公开提交的论文及其引文。我们利用该数据集提供了作者归属的基线,实现了一系列用于链接预测的图表示学习的最新模型,并讨论了我们生成的知识图结构的实用性。
Harvesting Textual and Structured Data from the HAL Publication Repository
HAL (Hyper Articles en Ligne) is the French national publication repository,
used by most higher education and research organizations for their open science
policy. As a digital library, it is a rich repository of scholarly documents,
but its potential for advanced research has been underutilized. We present
HALvest, a unique dataset that bridges the gap between citation networks and
the full text of papers submitted on HAL. We craft our dataset by filtering HAL
for scholarly publications, resulting in approximately 700,000 documents,
spanning 34 languages across 13 identified domains, suitable for language model
training, and yielding approximately 16.5 billion tokens (with 8 billion in
French and 7 billion in English, the most represented languages). We transform
the metadata of each paper into a citation network, producing a directed
heterogeneous graph. This graph includes uniquely identified authors on HAL, as
well as all open submitted papers, and their citations. We provide a baseline
for authorship attribution using the dataset, implement a range of
state-of-the-art models in graph representation learning for link prediction,
and discuss the usefulness of our generated knowledge graph structure.