Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie
{"title":"Callico: a Versatile Open-Source Document Image Annotation Platform","authors":"Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie","doi":"arxiv-2405.01071","DOIUrl":null,"url":null,"abstract":"This paper presents Callico, a web-based open source platform designed to\nsimplify the annotation process in document recognition projects. The move\ntowards data-centric AI in machine learning and deep learning underscores the\nimportance of high-quality data, and the need for specialised tools that\nincrease the efficiency and effectiveness of generating such data. For document\nimage annotation, Callico offers dual-display annotation for digitised\ndocuments, enabling simultaneous visualisation and annotation of scanned images\nand text. This capability is critical for OCR and HTR model training, document\nlayout analysis, named entity recognition, form-based key value annotation or\nhierarchical structure annotation with element grouping. The platform supports\ncollaborative annotation with versatile features backed by a commitment to open\nsource development, high-quality code standards and easy deployment via Docker.\nIllustrative use cases - including the transcription of the Belfort municipal\nregisters, the indexing of French World War II prisoners for the ICRC, and the\nextraction of personal information from the Socface project's census lists -\ndemonstrate Callico's applicability and utility.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents Callico, a web-based open source platform designed to
simplify the annotation process in document recognition projects. The move
towards data-centric AI in machine learning and deep learning underscores the
importance of high-quality data, and the need for specialised tools that
increase the efficiency and effectiveness of generating such data. For document
image annotation, Callico offers dual-display annotation for digitised
documents, enabling simultaneous visualisation and annotation of scanned images
and text. This capability is critical for OCR and HTR model training, document
layout analysis, named entity recognition, form-based key value annotation or
hierarchical structure annotation with element grouping. The platform supports
collaborative annotation with versatile features backed by a commitment to open
source development, high-quality code standards and easy deployment via Docker.
Illustrative use cases - including the transcription of the Belfort municipal
registers, the indexing of French World War II prisoners for the ICRC, and the
extraction of personal information from the Socface project's census lists -
demonstrate Callico's applicability and utility.