ZASCA-sum:南非最高法院上诉判决和媒体摘要数据集,用于法律文件摘要研究

IF 1 Q3 MULTIDISCIPLINARY SCIENCES
Idris Abdulmumin, Vukosi Marivate
{"title":"ZASCA-sum:南非最高法院上诉判决和媒体摘要数据集,用于法律文件摘要研究","authors":"Idris Abdulmumin,&nbsp;Vukosi Marivate","doi":"10.1016/j.dib.2025.111567","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents ZASCA-Sum, a novel dataset comprising judgments from the South Africa Supreme Court of Appeal and their manually curated media summaries. The dataset, collected from the court's official website, includes 4171 judgments, of which 2118 have summary pairs. The judgments and summaries have been extracted and prepared to support legal document summarization tasks across supervised, semi-supervised, and unsupervised settings. This paper provides a detailed description of the dataset, covering the data collection process, timeline, processing, and potential applications in the field. We provide the token-count distribution and analysis of the judgments and summaries that can be accommodated off-the-shelf by current summarization models with the largest input token size. The dataset, split into training, validation, and test sets, is made publicly available to encourage research in legal summarization. In addition to document summarization, researchers can use this data to localize English-centric models to support the South African dialect.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111567"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ZASCA-sum: A dataset of the South Africa supreme courts of appeal judgments and media summaries for legal documents summarization research\",\"authors\":\"Idris Abdulmumin,&nbsp;Vukosi Marivate\",\"doi\":\"10.1016/j.dib.2025.111567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper presents ZASCA-Sum, a novel dataset comprising judgments from the South Africa Supreme Court of Appeal and their manually curated media summaries. The dataset, collected from the court's official website, includes 4171 judgments, of which 2118 have summary pairs. The judgments and summaries have been extracted and prepared to support legal document summarization tasks across supervised, semi-supervised, and unsupervised settings. This paper provides a detailed description of the dataset, covering the data collection process, timeline, processing, and potential applications in the field. We provide the token-count distribution and analysis of the judgments and summaries that can be accommodated off-the-shelf by current summarization models with the largest input token size. The dataset, split into training, validation, and test sets, is made publicly available to encourage research in legal summarization. In addition to document summarization, researchers can use this data to localize English-centric models to support the South African dialect.</div></div>\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"60 \",\"pages\":\"Article 111567\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352340925002999\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925002999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

本文介绍了ZASCA-Sum,这是一个新颖的数据集,包括南非最高上诉法院的判决及其人工策划的媒体摘要。该数据集来自法院的官方网站,包括4171份判决,其中2118份是简易判决。已经提取并准备了判决和摘要,以支持监督、半监督和无监督环境下的法律文件摘要任务。本文提供了数据集的详细描述,包括数据收集过程、时间线、处理和在该领域的潜在应用。我们提供了令牌计数分布和判断和摘要的分析,这些判断和摘要可以由具有最大输入令牌大小的当前摘要模型现成地容纳。该数据集分为训练集、验证集和测试集,公开提供,以鼓励法律摘要方面的研究。除了文档摘要之外,研究人员还可以使用这些数据来本地化以英语为中心的模型,以支持南非方言。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ZASCA-sum: A dataset of the South Africa supreme courts of appeal judgments and media summaries for legal documents summarization research
This paper presents ZASCA-Sum, a novel dataset comprising judgments from the South Africa Supreme Court of Appeal and their manually curated media summaries. The dataset, collected from the court's official website, includes 4171 judgments, of which 2118 have summary pairs. The judgments and summaries have been extracted and prepared to support legal document summarization tasks across supervised, semi-supervised, and unsupervised settings. This paper provides a detailed description of the dataset, covering the data collection process, timeline, processing, and potential applications in the field. We provide the token-count distribution and analysis of the judgments and summaries that can be accommodated off-the-shelf by current summarization models with the largest input token size. The dataset, split into training, validation, and test sets, is made publicly available to encourage research in legal summarization. In addition to document summarization, researchers can use this data to localize English-centric models to support the South African dialect.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Data in Brief
Data in Brief MULTIDISCIPLINARY SCIENCES-
CiteScore
3.10
自引率
0.00%
发文量
996
审稿时长
70 days
期刊介绍: Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信