Marion McKenzie, Ellianna Abrahams, Fernando Pérez, Ryan Venturelli
{"title":"Response to Comment on ‘Automatic identification of streamlined subglacial bedforms using machine learning: an open-source Python approach’","authors":"Marion McKenzie, Ellianna Abrahams, Fernando Pérez, Ryan Venturelli","doi":"10.1111/bor.70004","DOIUrl":null,"url":null,"abstract":"<p>Li <i>et al</i>. (<span>2025</span> this issue) state that they identify areas for improvement to our development of bedfinder through including more data sets at training, evaluating our filtering methods, and exploring modular approaches of the tool. Here we respond to these Comments by highlighting where we have already addressed each of these areas within our work and notably, our supporting information (Abrahams <i>et al</i>. <span>2024</span>). In our paper, we describe that bedfinder is an inherently modular tool, allowing a user to choose which components of the pipeline might be useful to them. Furthermore, bedfinder already allows a user to customize the choice to over- or under-predict a glacially derived bedform assignment. In Abrahams <i>et al</i>. (<span>2024</span>) we previously justified why we made the choice to over-predict, emphasizing the need for manual post-processing, which we reiterate here. Finally, we reshare statements from the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>) where we also recommend incorporating additional data in future tool creation to strengthen our approach.</p><p>The objective of Abrahams <i>et al</i>. (<span>2024</span>) was to build an open-source tool that would allow for the automatic detection of glacially derived streamlined subglacial bedforms, based on the previous successes of manual approaches (e.g. Clark <span>1993</span>; Greenwood & Clark <span>2008</span>; Spagnolo <i>et al</i>. <span>2014</span>; Ely <i>et al</i>. <span>2016</span>; Principato <i>et al</i>. <span>2016</span>; Clark <i>et al</i>. <span>2018</span>). To develop this tool, we used Random Forest (Breiman <span>2001</span>), XGBoost (Chen & Guestrin <span>2016</span>), and an ensemble average of these two model fits on a publicly available training data set of nearly 600 000 data points across the deglaciated Northern Hemisphere (McKenzie <i>et al</i>. <span>2022</span>). Li <i>et al</i>. (<span>2025</span>) suggest a constructive critique to our approach stating our work should include more data sets at training, further evaluate our filtering methods, and explore better modularizing bedfinder. However, these suggestions have either already been implemented as existing features of bedfinder (i.e. modularity of pipeline components and tunability of bedform predictions to the needs of the user) or we have already named them in our study as known limitations of the tool that will require wider community participation (the creation of larger, more nuanced machine learning data sets for model training).</p><p>Abrahams <i>et al</i>. (<span>2024</span>) outlined the limitations of the presented approach in the section ‘Model limitations’. There, we state that the ‘TPI tool used to compile our training set performs most poorly in regions with highly elongate bedforms with low surface relief (McKenzie <i>et al</i>. <span>2022</span>)’ (i.e. bedforms across crystalline bedrock surfaces) (Abrahams <i>et al</i>. <span>2024</span>), which Li <i>et al</i>. (<span>2025</span>) also note as a limitation in their Comments. Further comments in Li <i>et al</i>. (<span>2025</span>) reflect the performance and strengths of the TPI tool, which is not the focus of Abrahams <i>et al</i>. (<span>2024</span>) and is instead the focus of McKenzie <i>et al</i>. (<span>2022</span>). Even so, justification for the choice to make binary classifications of the data is explained within the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>). The solution Li <i>et al</i>. (<span>2025</span>) present to overcome the training data set limitations is one that was explicitly outlined within the ‘Future directions for tool advancements’ section in Abrahams <i>et al</i>. (<span>2024</span>). Almost identically to what Li <i>et al</i>. (<span>2025</span>) present as their solution to this problem, we express within our original work that ‘future incorporation of additional training data that increase representation of low relief and multi-directional ice flow’ will help overcome these deficiencies in bedfinder (Abrahams <i>et al</i>. <span>2024</span>).</p><p>In their Comments, Li <i>et al</i>. (<span>2025</span>) claim that the performance metrics used in Abrahams <i>et al</i>. (<span>2024</span>) obfuscate true performance given the inherent class imbalance within the training data, despite the fact that this exact concern is the main focus of the Supporting Information Data S2. There we highlight our focus on the F1 score as a preferred metric for model selection in the case of class imbalance because its usage ‘prevents an overly optimistic assessment’ and is ‘balanced towards a low rate of false positives and false negatives’ (Abrahams <i>et al</i>. <span>2024</span>). We direct the interested reader to He and Garcia (<span>2009</span>) to read more about the usefulness of the F1 score in the case of class imbalance (Flach & Kull <span>2015</span> is another excellent resource).</p><p>We would like to emphasize, as should be implicitly understood, that the performance metrics in Table 2 of Abrahams <i>et al</i>. (<span>2024</span>) are only achievable on tests within the distribution of the training set. To investigate potential distribution shift, we validate bedfinder on an out-of-distribution (OOD) sample, as is detailed in the section titled ‘Model validation through a subset of Green Bay Lobe bedforms’, finding that bedfinder, as it is tuned within the paper, recovers >79% of the true positives there. Li <i>et al</i>. (<span>2025</span>) make the recommendation to withhold individual regions as test data, and we are sure that they will be happy to see that this analysis is already completed in that section of Abrahams <i>et al</i>. (<span>2024</span>). In the paper, we shared the accuracy achieved along with the ROC curves for the new region, and in our publicly available accompanying materials (Abrahams & McKenzie <span>2024</span>) we shared recall, precision, and F1 score. We take this opportunity to reiterate that ‘the choice of sites for the training data set limits the model's ability to extrapolate results to new regions with topographic constraints or bedrock types that are out-of-distribution (OOD). Any applications of the tools in this paper to OOD data are statistically unreliable but will still provide a starting point to analysing presence of streamlined subglacial bedforms across a deglaciated landscape’ (Abrahams <i>et al</i>. <span>2024</span>).</p><p>As stated, bedfinder is not intended as the only solution, but rather a starting point for scientific practitioners hoping to automate some of the process of describing landforms within a deglaciated landscape. In claiming that Abrahams <i>et al</i>. (<span>2024</span>) misrepresent the implications of tuning our model approach towards overestimating false positives (which leads to the inclusion of mistaken detections) instead of tuning towards overestimating false negatives (which leads to the exclusion of definite bedforms at the distribution's edge), Li <i>et al</i>. (<span>2025</span>) seem to suggest that users would indiscriminately apply this tool without critically evaluating its outcomes, <i>which would explicitly go against the usage recommendations outlined within our paper and documentation</i>. Not only does our paper make the model tuning towards over-prediction clear, we recommend the reader to run inferences from all three model fits and compare the results within the paper on any new data to aid in manual post-processing. Furthermore, in the publicly available tool documentation we indicate how our tool can be used by the reader to tune towards overestimating false negatives should they wish (Abrahams & McKenzie <span>2024</span>).</p><p>In a further claim, Li <i>et al</i>. (<span>2025</span>) state that Abrahams <i>et al</i>. (<span>2024</span>) have not ‘sufficiently explored’ the tradeoff between false positives and false negatives. We disagree, as this tradeoff is established to be well described by the F1 score, which is commonly used to assess performance in the presence of class imbalance (He & Garcia <span>2009</span>; Flach & Kull <span>2015</span>, among others), as described above and shared in the Supporting Information Data S2, where we state ‘for this reason, we primarily focus on F1 score throughout this work’. Furthermore, the ROC curves shown in Fig. 4A (Abrahams <i>et al</i>. <span>2024</span>) explore this argument. Li <i>et al</i>. (<span>2025</span>) also suggest that we should have utilized a precision-recall curve (precision vs. recall) instead of an ROC curve (true positive rate vs. false positive rate) in making our model selection; however, as is highlighted in Flach & Kull (<span>2015</span>) and elsewhere, precision-recall curves can be misleading in cases of class imbalance where, in contrast, it has been widely established that ROC curves are not sensitive to class ratio (Fawcett <span>2006</span>) and therefore a more stable approach. Recently, McDermott <i>et al</i>. (<span>2024</span>) demonstrated that the area under the precision-recall curve is an explicitly biased and discriminatory metric for model selection, and recommended relying on the area under an ROC curve in cases where minimizing false negatives is more important than minimizing false positives, which are the stated needs of our paper.</p><p>Li <i>et al</i>. (<span>2025</span>) disagree with our choice to bias our model fit towards false positives; however, in any ML model there is a tuning choice between biasing the model towards overprediction (more false positives) or underprediction (more false negatives) which are inherently in conflict in imbalanced data sets (Chawla <span>2005</span>). Since we know that post-processing is a viable alternative for any user, and indeed, we recommend this to any practitioner who implements our tool, we choose to tune towards overprediction in order not to miss any true positives. The model output needs to be manually assessed for accuracy, but starting from a data set with a fraction of false positives is preferable for glacial geomorphologists rather than revisiting the raw elevation data to identify missed true positives. If a user, like Li <i>et al</i>. (<span>2025</span>) with their preference towards underprediction, prefers another tuning however, bedfinder (Abrahams & McKenzie <span>2024</span>) can be implemented <i>as is</i> to select a stricter probability threshold in its prediction and therefore be tuned towards false positives. Figure 4B in Abrahams <i>et al</i>. (<span>2024</span>) illustrates the tradeoff between precision and recall as this probability threshold is tuned.</p><p>We are surprised to see that Li <i>et al</i>. (<span>2025</span>) take issue with our choice to filter the data as a preprocessing step in our pipeline, as resampling is a well-established method for preparing imbalanced data for use with ML (Chawla <span>2005</span>; He & Garcia <span>2009</span>, among others). He and Garcia (<span>2009</span>) in ‘Learning from imbalanced data’, recommend undersampling the majority class rather than oversampling to avoid overfitting, and broadly divide this approach into two categories: ‘random undersampling’ and ‘informed undersampling’. Not only did we preliminarily train and test the statistical model on the unfiltered data set (see links to the GitHub repository containing the accompanying analysis files in the Data Availability statement of our paper, Abrahams & McKenzie <span>2024</span>), we also trained and tested the statistical model on random undersampling as defined by He and Garcia (<span>2009</span>), as is shown in Fig. 3 of Abrahams <i>et al</i>. (<span>2024</span>). We found that the model was only able to recover 22% of the true positives without any filtering (Fig. 3 of Abrahams <i>et al</i>. <span>2024</span> shows that random undersampling only recovers 43% of the true positives). These model fits – trained on unfiltered data and on data where the non-glacially derived landforms were randomly undersampled – were biased towards underidentifying true glacially derived bedforms. In both cases, flipping a fair coin for class assignment would provide better recall than the model fits. (Of course, a coin flip would also provide worse precision, since in both of these cases the model fits are biased towards missing true glacially derived bedforms and would therefore assign negatives (both correctly and incorrectly) more often than a fair coin flip would. A coin flip approach would therefore not alleviate the need for manual assignment on the entire input data set any more than models trained on unfiltered or unusefully filtered data, but does illustrate the need for informed filtering in this case). As requested by Li <i>et al</i>. (<span>2025</span>) we reproduce the confusion matrices for these investigations here in Fig. 1, which illustrate how the majority of bedforms are missed while using unfiltered or unusefully filtered data.</p><p>He & Garcia (<span>2009</span>) offer several examples of informed undersampling algorithms, including Near Miss, all of which implement purely statistical approaches to filtering the data, often requiring advanced knowledge of class assignment. Ultimately, after testing several undersampling approaches (see the analysis files accompanying our paper), we settled on Near Miss as it provided the most generalizable results on a withheld test set. Li <i>et al</i>. (<span>2025</span>) state that our approach is missing these ablation studies; however, we tested a variety of filtering approaches, including data cleaning techniques, before finalizing the choice of Near Miss, and all of these ablation studies are available in our accompanying publicly available GitHub repository (Abrahams & McKenzie <span>2024</span>).</p><p>As mentioned above, Near Miss and other informed undersampling algorithms (e.g. He & Garcia <span>2009</span>) rely on statistical filtering approaches to create greater distance between classes within the data. However, when a scientific expert manually labels bedforms, often with a prefilter approach to remove obviously spurious detections, the technique informing this filter is scientifically motivated rather than statistically motivated. We believe that Abrahams <i>et al</i>. (<span>2024</span>) speaks for itself in the power of combining these two approaches. Li <i>et al</i>. (<span>2025</span>) claim that the choices that led to our filtering approach are not transparent, but we would like to remind the reader that the opposite is true: our filtering approach is clearly outlined and probed in the ‘Filtering the non-glacial features to balance classes’ section of the paper and we make these filtering options easily available to the reader as the <i>filtering</i> function in bedfinder, and in the accompanying GitHub repo (Abrahams & McKenzie <span>2024</span>). These improvements in transparency of decision making are a large improvement to the ‘[…] manual classification choices [that vary] from expert to expert [that] can lead to difficult to reproduce, subjective classification schema with associated unquantified error’ (Abrahams <i>et al</i>. <span>2024</span>). We would also like to emphasize that Near Miss <i>is itself</i> a ‘data-driven filtering technique’. In driving an approach that combines statistical filtering with scientific filtering, we have already created an implementation that allows the model to adjust to new data without overfitting. The power of this combined approach allows scientific information to guide the model in a way that can still be generalized to new regions.</p><p>We agree with the astute observation from Li <i>et al</i>. (<span>2025</span>), raised first in Abrahams <i>et al</i>. (<span>2024</span>), that addressing the need for expanded publicly available training sets will refine the model's ability to perform in OOD regions. To showcase our initial argument in support of open science and reproducibility, in writing this response, the authors were able to address all of Li <i>et al</i>.'s (<span>2025</span>) concerns without the need to conduct any new analyses and by providing references to the already existing, fully available data and results from their initial study (Abrahams & McKenzie <span>2024</span>). We thank the authors of the Comments for their support in helping us illustrate this point. By incorporating existing glacial geomorphology training data into the findable, accessible, interoperable, and reproducible (FAIR; Wilkinson <span>2016</span>) data framework, the potential of bedfinder and other powerful geospatial modelling tools to be trained on well-rounded and robust data sets will be strengthened.</p>","PeriodicalId":9184,"journal":{"name":"Boreas","volume":"54 2","pages":"277-280"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bor.70004","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Boreas","FirstCategoryId":"89","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bor.70004","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Li et al. (2025 this issue) state that they identify areas for improvement to our development of bedfinder through including more data sets at training, evaluating our filtering methods, and exploring modular approaches of the tool. Here we respond to these Comments by highlighting where we have already addressed each of these areas within our work and notably, our supporting information (Abrahams et al. 2024). In our paper, we describe that bedfinder is an inherently modular tool, allowing a user to choose which components of the pipeline might be useful to them. Furthermore, bedfinder already allows a user to customize the choice to over- or under-predict a glacially derived bedform assignment. In Abrahams et al. (2024) we previously justified why we made the choice to over-predict, emphasizing the need for manual post-processing, which we reiterate here. Finally, we reshare statements from the ‘Model limitations’ section of Abrahams et al. (2024) where we also recommend incorporating additional data in future tool creation to strengthen our approach.
The objective of Abrahams et al. (2024) was to build an open-source tool that would allow for the automatic detection of glacially derived streamlined subglacial bedforms, based on the previous successes of manual approaches (e.g. Clark 1993; Greenwood & Clark 2008; Spagnolo et al. 2014; Ely et al. 2016; Principato et al. 2016; Clark et al. 2018). To develop this tool, we used Random Forest (Breiman 2001), XGBoost (Chen & Guestrin 2016), and an ensemble average of these two model fits on a publicly available training data set of nearly 600 000 data points across the deglaciated Northern Hemisphere (McKenzie et al. 2022). Li et al. (2025) suggest a constructive critique to our approach stating our work should include more data sets at training, further evaluate our filtering methods, and explore better modularizing bedfinder. However, these suggestions have either already been implemented as existing features of bedfinder (i.e. modularity of pipeline components and tunability of bedform predictions to the needs of the user) or we have already named them in our study as known limitations of the tool that will require wider community participation (the creation of larger, more nuanced machine learning data sets for model training).
Abrahams et al. (2024) outlined the limitations of the presented approach in the section ‘Model limitations’. There, we state that the ‘TPI tool used to compile our training set performs most poorly in regions with highly elongate bedforms with low surface relief (McKenzie et al. 2022)’ (i.e. bedforms across crystalline bedrock surfaces) (Abrahams et al. 2024), which Li et al. (2025) also note as a limitation in their Comments. Further comments in Li et al. (2025) reflect the performance and strengths of the TPI tool, which is not the focus of Abrahams et al. (2024) and is instead the focus of McKenzie et al. (2022). Even so, justification for the choice to make binary classifications of the data is explained within the ‘Model limitations’ section of Abrahams et al. (2024). The solution Li et al. (2025) present to overcome the training data set limitations is one that was explicitly outlined within the ‘Future directions for tool advancements’ section in Abrahams et al. (2024). Almost identically to what Li et al. (2025) present as their solution to this problem, we express within our original work that ‘future incorporation of additional training data that increase representation of low relief and multi-directional ice flow’ will help overcome these deficiencies in bedfinder (Abrahams et al. 2024).
In their Comments, Li et al. (2025) claim that the performance metrics used in Abrahams et al. (2024) obfuscate true performance given the inherent class imbalance within the training data, despite the fact that this exact concern is the main focus of the Supporting Information Data S2. There we highlight our focus on the F1 score as a preferred metric for model selection in the case of class imbalance because its usage ‘prevents an overly optimistic assessment’ and is ‘balanced towards a low rate of false positives and false negatives’ (Abrahams et al. 2024). We direct the interested reader to He and Garcia (2009) to read more about the usefulness of the F1 score in the case of class imbalance (Flach & Kull 2015 is another excellent resource).
We would like to emphasize, as should be implicitly understood, that the performance metrics in Table 2 of Abrahams et al. (2024) are only achievable on tests within the distribution of the training set. To investigate potential distribution shift, we validate bedfinder on an out-of-distribution (OOD) sample, as is detailed in the section titled ‘Model validation through a subset of Green Bay Lobe bedforms’, finding that bedfinder, as it is tuned within the paper, recovers >79% of the true positives there. Li et al. (2025) make the recommendation to withhold individual regions as test data, and we are sure that they will be happy to see that this analysis is already completed in that section of Abrahams et al. (2024). In the paper, we shared the accuracy achieved along with the ROC curves for the new region, and in our publicly available accompanying materials (Abrahams & McKenzie 2024) we shared recall, precision, and F1 score. We take this opportunity to reiterate that ‘the choice of sites for the training data set limits the model's ability to extrapolate results to new regions with topographic constraints or bedrock types that are out-of-distribution (OOD). Any applications of the tools in this paper to OOD data are statistically unreliable but will still provide a starting point to analysing presence of streamlined subglacial bedforms across a deglaciated landscape’ (Abrahams et al. 2024).
As stated, bedfinder is not intended as the only solution, but rather a starting point for scientific practitioners hoping to automate some of the process of describing landforms within a deglaciated landscape. In claiming that Abrahams et al. (2024) misrepresent the implications of tuning our model approach towards overestimating false positives (which leads to the inclusion of mistaken detections) instead of tuning towards overestimating false negatives (which leads to the exclusion of definite bedforms at the distribution's edge), Li et al. (2025) seem to suggest that users would indiscriminately apply this tool without critically evaluating its outcomes, which would explicitly go against the usage recommendations outlined within our paper and documentation. Not only does our paper make the model tuning towards over-prediction clear, we recommend the reader to run inferences from all three model fits and compare the results within the paper on any new data to aid in manual post-processing. Furthermore, in the publicly available tool documentation we indicate how our tool can be used by the reader to tune towards overestimating false negatives should they wish (Abrahams & McKenzie 2024).
In a further claim, Li et al. (2025) state that Abrahams et al. (2024) have not ‘sufficiently explored’ the tradeoff between false positives and false negatives. We disagree, as this tradeoff is established to be well described by the F1 score, which is commonly used to assess performance in the presence of class imbalance (He & Garcia 2009; Flach & Kull 2015, among others), as described above and shared in the Supporting Information Data S2, where we state ‘for this reason, we primarily focus on F1 score throughout this work’. Furthermore, the ROC curves shown in Fig. 4A (Abrahams et al. 2024) explore this argument. Li et al. (2025) also suggest that we should have utilized a precision-recall curve (precision vs. recall) instead of an ROC curve (true positive rate vs. false positive rate) in making our model selection; however, as is highlighted in Flach & Kull (2015) and elsewhere, precision-recall curves can be misleading in cases of class imbalance where, in contrast, it has been widely established that ROC curves are not sensitive to class ratio (Fawcett 2006) and therefore a more stable approach. Recently, McDermott et al. (2024) demonstrated that the area under the precision-recall curve is an explicitly biased and discriminatory metric for model selection, and recommended relying on the area under an ROC curve in cases where minimizing false negatives is more important than minimizing false positives, which are the stated needs of our paper.
Li et al. (2025) disagree with our choice to bias our model fit towards false positives; however, in any ML model there is a tuning choice between biasing the model towards overprediction (more false positives) or underprediction (more false negatives) which are inherently in conflict in imbalanced data sets (Chawla 2005). Since we know that post-processing is a viable alternative for any user, and indeed, we recommend this to any practitioner who implements our tool, we choose to tune towards overprediction in order not to miss any true positives. The model output needs to be manually assessed for accuracy, but starting from a data set with a fraction of false positives is preferable for glacial geomorphologists rather than revisiting the raw elevation data to identify missed true positives. If a user, like Li et al. (2025) with their preference towards underprediction, prefers another tuning however, bedfinder (Abrahams & McKenzie 2024) can be implemented as is to select a stricter probability threshold in its prediction and therefore be tuned towards false positives. Figure 4B in Abrahams et al. (2024) illustrates the tradeoff between precision and recall as this probability threshold is tuned.
We are surprised to see that Li et al. (2025) take issue with our choice to filter the data as a preprocessing step in our pipeline, as resampling is a well-established method for preparing imbalanced data for use with ML (Chawla 2005; He & Garcia 2009, among others). He and Garcia (2009) in ‘Learning from imbalanced data’, recommend undersampling the majority class rather than oversampling to avoid overfitting, and broadly divide this approach into two categories: ‘random undersampling’ and ‘informed undersampling’. Not only did we preliminarily train and test the statistical model on the unfiltered data set (see links to the GitHub repository containing the accompanying analysis files in the Data Availability statement of our paper, Abrahams & McKenzie 2024), we also trained and tested the statistical model on random undersampling as defined by He and Garcia (2009), as is shown in Fig. 3 of Abrahams et al. (2024). We found that the model was only able to recover 22% of the true positives without any filtering (Fig. 3 of Abrahams et al. 2024 shows that random undersampling only recovers 43% of the true positives). These model fits – trained on unfiltered data and on data where the non-glacially derived landforms were randomly undersampled – were biased towards underidentifying true glacially derived bedforms. In both cases, flipping a fair coin for class assignment would provide better recall than the model fits. (Of course, a coin flip would also provide worse precision, since in both of these cases the model fits are biased towards missing true glacially derived bedforms and would therefore assign negatives (both correctly and incorrectly) more often than a fair coin flip would. A coin flip approach would therefore not alleviate the need for manual assignment on the entire input data set any more than models trained on unfiltered or unusefully filtered data, but does illustrate the need for informed filtering in this case). As requested by Li et al. (2025) we reproduce the confusion matrices for these investigations here in Fig. 1, which illustrate how the majority of bedforms are missed while using unfiltered or unusefully filtered data.
He & Garcia (2009) offer several examples of informed undersampling algorithms, including Near Miss, all of which implement purely statistical approaches to filtering the data, often requiring advanced knowledge of class assignment. Ultimately, after testing several undersampling approaches (see the analysis files accompanying our paper), we settled on Near Miss as it provided the most generalizable results on a withheld test set. Li et al. (2025) state that our approach is missing these ablation studies; however, we tested a variety of filtering approaches, including data cleaning techniques, before finalizing the choice of Near Miss, and all of these ablation studies are available in our accompanying publicly available GitHub repository (Abrahams & McKenzie 2024).
As mentioned above, Near Miss and other informed undersampling algorithms (e.g. He & Garcia 2009) rely on statistical filtering approaches to create greater distance between classes within the data. However, when a scientific expert manually labels bedforms, often with a prefilter approach to remove obviously spurious detections, the technique informing this filter is scientifically motivated rather than statistically motivated. We believe that Abrahams et al. (2024) speaks for itself in the power of combining these two approaches. Li et al. (2025) claim that the choices that led to our filtering approach are not transparent, but we would like to remind the reader that the opposite is true: our filtering approach is clearly outlined and probed in the ‘Filtering the non-glacial features to balance classes’ section of the paper and we make these filtering options easily available to the reader as the filtering function in bedfinder, and in the accompanying GitHub repo (Abrahams & McKenzie 2024). These improvements in transparency of decision making are a large improvement to the ‘[…] manual classification choices [that vary] from expert to expert [that] can lead to difficult to reproduce, subjective classification schema with associated unquantified error’ (Abrahams et al. 2024). We would also like to emphasize that Near Miss is itself a ‘data-driven filtering technique’. In driving an approach that combines statistical filtering with scientific filtering, we have already created an implementation that allows the model to adjust to new data without overfitting. The power of this combined approach allows scientific information to guide the model in a way that can still be generalized to new regions.
We agree with the astute observation from Li et al. (2025), raised first in Abrahams et al. (2024), that addressing the need for expanded publicly available training sets will refine the model's ability to perform in OOD regions. To showcase our initial argument in support of open science and reproducibility, in writing this response, the authors were able to address all of Li et al.'s (2025) concerns without the need to conduct any new analyses and by providing references to the already existing, fully available data and results from their initial study (Abrahams & McKenzie 2024). We thank the authors of the Comments for their support in helping us illustrate this point. By incorporating existing glacial geomorphology training data into the findable, accessible, interoperable, and reproducible (FAIR; Wilkinson 2016) data framework, the potential of bedfinder and other powerful geospatial modelling tools to be trained on well-rounded and robust data sets will be strengthened.
期刊介绍:
Boreas has been published since 1972. Articles of wide international interest from all branches of Quaternary research are published. Biological as well as non-biological aspects of the Quaternary environment, in both glaciated and non-glaciated areas, are dealt with: Climate, shore displacement, glacial features, landforms, sediments, organisms and their habitat, and stratigraphical and chronological relationships.
Anticipated international interest, at least within a continent or a considerable part of it, is a main criterion for the acceptance of papers. Besides articles, short items like discussion contributions and book reviews are published.