Xiuan Wan , Zhengchen Li , Dandan Liang , Shouyong Pan , Yuchun Fang
{"title":"OBCTeacher: Resisting labeled data scarcity in oracle bone character detection by semi-supervised learning","authors":"Xiuan Wan , Zhengchen Li , Dandan Liang , Shouyong Pan , Yuchun Fang","doi":"10.1016/j.ipm.2024.103864","DOIUrl":null,"url":null,"abstract":"<div><p>Oracle bone characters (OBCs) are ancient ideographs for divination and memorization, as well as first-hand evidence of ancient Chinese culture. The detection of OBC is the premise of advanced studies and was mainly done by authoritative experts in the past. Deep learning techniques have great potential to facilitate OBC detection, but the high annotation cost of OBC brings the scarcity of labeled data, hindering its application. This paper proposes a novel OBC detection framework called OBCTeacher based on semi-supervised learning (SSL) to resist labeled data scarcity. We first construct a large-scale OBC detection dataset. Through investigation, we find that spatial mismatching and class imbalance problems lead to decreased positive anchors and biased predictions, affecting the quality of pseudo labels and the performance of OBC detection. To mitigate the spatial mismatching problem, we introduce a geometric-priori-based anchor assignment strategy and a heatmap polishing procedure to increase positive anchors and improve the quality of pseudo labels. As for the class imbalance problem, we propose a re-weighting method based on estimated class information and a contrastive anchor loss to achieve prioritized learning on different OBC classes and better class boundaries. We evaluate our method by using only a small portion of labeled data while using the remaining data as unlabeled and all labeled data with extra unlabeled data. The results demonstrate the effectiveness of our method compared with other state-of-the-art methods by superior performance and significant improvements of an average of 11.97 in <span><math><mrow><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mn>50</mn><mo>:</mo><mn>95</mn></mrow></msub></mrow></math></span> against the only supervised baseline. In addition, our method achieves comparable performance using only 20% of labeled data to the fully-supervised baseline using 100% of labeled data, demonstrating that our method significantly reduces the dependence on labeled data for OBC detection.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002231","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Oracle bone characters (OBCs) are ancient ideographs for divination and memorization, as well as first-hand evidence of ancient Chinese culture. The detection of OBC is the premise of advanced studies and was mainly done by authoritative experts in the past. Deep learning techniques have great potential to facilitate OBC detection, but the high annotation cost of OBC brings the scarcity of labeled data, hindering its application. This paper proposes a novel OBC detection framework called OBCTeacher based on semi-supervised learning (SSL) to resist labeled data scarcity. We first construct a large-scale OBC detection dataset. Through investigation, we find that spatial mismatching and class imbalance problems lead to decreased positive anchors and biased predictions, affecting the quality of pseudo labels and the performance of OBC detection. To mitigate the spatial mismatching problem, we introduce a geometric-priori-based anchor assignment strategy and a heatmap polishing procedure to increase positive anchors and improve the quality of pseudo labels. As for the class imbalance problem, we propose a re-weighting method based on estimated class information and a contrastive anchor loss to achieve prioritized learning on different OBC classes and better class boundaries. We evaluate our method by using only a small portion of labeled data while using the remaining data as unlabeled and all labeled data with extra unlabeled data. The results demonstrate the effectiveness of our method compared with other state-of-the-art methods by superior performance and significant improvements of an average of 11.97 in against the only supervised baseline. In addition, our method achieves comparable performance using only 20% of labeled data to the fully-supervised baseline using 100% of labeled data, demonstrating that our method significantly reduces the dependence on labeled data for OBC detection.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.