Etienne van de Bijl, Jan Klein, Joris Pries, Sandjai Bhulai, Mark Hoogendoorn, Rob van der Mei
{"title":"The dutch draw: constructing a universal baseline for binary classification problems","authors":"Etienne van de Bijl, Jan Klein, Joris Pries, Sandjai Bhulai, Mark Hoogendoorn, Rob van der Mei","doi":"10.1017/jpr.2024.52","DOIUrl":null,"url":null,"abstract":"<p>Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an <span><span><img data-mimesubtype=\"png\" data-type=\"\" src=\"https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20240918134025706-0265:S0021900224000524:S0021900224000524_inline1.png\"><span data-mathjax-type=\"texmath\"><span>$F_1$</span></span></img></span></span> of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all <span>binary classification</span> models, named the <span>Dutch Draw</span> (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is <span>general</span>, as it is applicable to any binary classification problem; <span>simple</span>, as it can be quickly determined without training or parameter tuning; and <span>informative</span>, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.</p>","PeriodicalId":50256,"journal":{"name":"Journal of Applied Probability","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Probability","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1017/jpr.2024.52","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an $F_1$ of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.
期刊介绍:
Journal of Applied Probability is the oldest journal devoted to the publication of research in the field of applied probability. It is an international journal published by the Applied Probability Trust, and it serves as a companion publication to the Advances in Applied Probability. Its wide audience includes leading researchers across the entire spectrum of applied probability, including biosciences applications, operations research, telecommunications, computer science, engineering, epidemiology, financial mathematics, the physical and social sciences, and any field where stochastic modeling is used.
A submission to Applied Probability represents a submission that may, at the Editor-in-Chief’s discretion, appear in either the Journal of Applied Probability or the Advances in Applied Probability. Typically, shorter papers appear in the Journal, with longer contributions appearing in the Advances.