{"title":"ZASCA-sum: A dataset of the South Africa supreme courts of appeal judgments and media summaries for legal documents summarization research","authors":"Idris Abdulmumin, Vukosi Marivate","doi":"10.1016/j.dib.2025.111567","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents ZASCA-Sum, a novel dataset comprising judgments from the South Africa Supreme Court of Appeal and their manually curated media summaries. The dataset, collected from the court's official website, includes 4171 judgments, of which 2118 have summary pairs. The judgments and summaries have been extracted and prepared to support legal document summarization tasks across supervised, semi-supervised, and unsupervised settings. This paper provides a detailed description of the dataset, covering the data collection process, timeline, processing, and potential applications in the field. We provide the token-count distribution and analysis of the judgments and summaries that can be accommodated off-the-shelf by current summarization models with the largest input token size. The dataset, split into training, validation, and test sets, is made publicly available to encourage research in legal summarization. In addition to document summarization, researchers can use this data to localize English-centric models to support the South African dialect.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111567"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925002999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents ZASCA-Sum, a novel dataset comprising judgments from the South Africa Supreme Court of Appeal and their manually curated media summaries. The dataset, collected from the court's official website, includes 4171 judgments, of which 2118 have summary pairs. The judgments and summaries have been extracted and prepared to support legal document summarization tasks across supervised, semi-supervised, and unsupervised settings. This paper provides a detailed description of the dataset, covering the data collection process, timeline, processing, and potential applications in the field. We provide the token-count distribution and analysis of the judgments and summaries that can be accommodated off-the-shelf by current summarization models with the largest input token size. The dataset, split into training, validation, and test sets, is made publicly available to encourage research in legal summarization. In addition to document summarization, researchers can use this data to localize English-centric models to support the South African dialect.
期刊介绍:
Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.