Marion McKenzie, Ellianna Abrahams, Fernando Pérez, Ryan Venturelli
{"title":"对“使用机器学习自动识别冰下流线型河床:一种开源的Python方法”评论的回应","authors":"Marion McKenzie, Ellianna Abrahams, Fernando Pérez, Ryan Venturelli","doi":"10.1111/bor.70004","DOIUrl":null,"url":null,"abstract":"<p>Li <i>et al</i>. (<span>2025</span> this issue) state that they identify areas for improvement to our development of bedfinder through including more data sets at training, evaluating our filtering methods, and exploring modular approaches of the tool. Here we respond to these Comments by highlighting where we have already addressed each of these areas within our work and notably, our supporting information (Abrahams <i>et al</i>. <span>2024</span>). In our paper, we describe that bedfinder is an inherently modular tool, allowing a user to choose which components of the pipeline might be useful to them. Furthermore, bedfinder already allows a user to customize the choice to over- or under-predict a glacially derived bedform assignment. In Abrahams <i>et al</i>. (<span>2024</span>) we previously justified why we made the choice to over-predict, emphasizing the need for manual post-processing, which we reiterate here. Finally, we reshare statements from the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>) where we also recommend incorporating additional data in future tool creation to strengthen our approach.</p><p>The objective of Abrahams <i>et al</i>. (<span>2024</span>) was to build an open-source tool that would allow for the automatic detection of glacially derived streamlined subglacial bedforms, based on the previous successes of manual approaches (e.g. Clark <span>1993</span>; Greenwood & Clark <span>2008</span>; Spagnolo <i>et al</i>. <span>2014</span>; Ely <i>et al</i>. <span>2016</span>; Principato <i>et al</i>. <span>2016</span>; Clark <i>et al</i>. <span>2018</span>). To develop this tool, we used Random Forest (Breiman <span>2001</span>), XGBoost (Chen & Guestrin <span>2016</span>), and an ensemble average of these two model fits on a publicly available training data set of nearly 600 000 data points across the deglaciated Northern Hemisphere (McKenzie <i>et al</i>. <span>2022</span>). Li <i>et al</i>. (<span>2025</span>) suggest a constructive critique to our approach stating our work should include more data sets at training, further evaluate our filtering methods, and explore better modularizing bedfinder. However, these suggestions have either already been implemented as existing features of bedfinder (i.e. modularity of pipeline components and tunability of bedform predictions to the needs of the user) or we have already named them in our study as known limitations of the tool that will require wider community participation (the creation of larger, more nuanced machine learning data sets for model training).</p><p>Abrahams <i>et al</i>. (<span>2024</span>) outlined the limitations of the presented approach in the section ‘Model limitations’. There, we state that the ‘TPI tool used to compile our training set performs most poorly in regions with highly elongate bedforms with low surface relief (McKenzie <i>et al</i>. <span>2022</span>)’ (i.e. bedforms across crystalline bedrock surfaces) (Abrahams <i>et al</i>. <span>2024</span>), which Li <i>et al</i>. (<span>2025</span>) also note as a limitation in their Comments. Further comments in Li <i>et al</i>. (<span>2025</span>) reflect the performance and strengths of the TPI tool, which is not the focus of Abrahams <i>et al</i>. (<span>2024</span>) and is instead the focus of McKenzie <i>et al</i>. (<span>2022</span>). Even so, justification for the choice to make binary classifications of the data is explained within the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>). The solution Li <i>et al</i>. (<span>2025</span>) present to overcome the training data set limitations is one that was explicitly outlined within the ‘Future directions for tool advancements’ section in Abrahams <i>et al</i>. (<span>2024</span>). Almost identically to what Li <i>et al</i>. (<span>2025</span>) present as their solution to this problem, we express within our original work that ‘future incorporation of additional training data that increase representation of low relief and multi-directional ice flow’ will help overcome these deficiencies in bedfinder (Abrahams <i>et al</i>. <span>2024</span>).</p><p>In their Comments, Li <i>et al</i>. (<span>2025</span>) claim that the performance metrics used in Abrahams <i>et al</i>. (<span>2024</span>) obfuscate true performance given the inherent class imbalance within the training data, despite the fact that this exact concern is the main focus of the Supporting Information Data S2. There we highlight our focus on the F1 score as a preferred metric for model selection in the case of class imbalance because its usage ‘prevents an overly optimistic assessment’ and is ‘balanced towards a low rate of false positives and false negatives’ (Abrahams <i>et al</i>. <span>2024</span>). We direct the interested reader to He and Garcia (<span>2009</span>) to read more about the usefulness of the F1 score in the case of class imbalance (Flach & Kull <span>2015</span> is another excellent resource).</p><p>We would like to emphasize, as should be implicitly understood, that the performance metrics in Table 2 of Abrahams <i>et al</i>. (<span>2024</span>) are only achievable on tests within the distribution of the training set. To investigate potential distribution shift, we validate bedfinder on an out-of-distribution (OOD) sample, as is detailed in the section titled ‘Model validation through a subset of Green Bay Lobe bedforms’, finding that bedfinder, as it is tuned within the paper, recovers >79% of the true positives there. Li <i>et al</i>. (<span>2025</span>) make the recommendation to withhold individual regions as test data, and we are sure that they will be happy to see that this analysis is already completed in that section of Abrahams <i>et al</i>. (<span>2024</span>). In the paper, we shared the accuracy achieved along with the ROC curves for the new region, and in our publicly available accompanying materials (Abrahams & McKenzie <span>2024</span>) we shared recall, precision, and F1 score. We take this opportunity to reiterate that ‘the choice of sites for the training data set limits the model's ability to extrapolate results to new regions with topographic constraints or bedrock types that are out-of-distribution (OOD). Any applications of the tools in this paper to OOD data are statistically unreliable but will still provide a starting point to analysing presence of streamlined subglacial bedforms across a deglaciated landscape’ (Abrahams <i>et al</i>. <span>2024</span>).</p><p>As stated, bedfinder is not intended as the only solution, but rather a starting point for scientific practitioners hoping to automate some of the process of describing landforms within a deglaciated landscape. In claiming that Abrahams <i>et al</i>. (<span>2024</span>) misrepresent the implications of tuning our model approach towards overestimating false positives (which leads to the inclusion of mistaken detections) instead of tuning towards overestimating false negatives (which leads to the exclusion of definite bedforms at the distribution's edge), Li <i>et al</i>. (<span>2025</span>) seem to suggest that users would indiscriminately apply this tool without critically evaluating its outcomes, <i>which would explicitly go against the usage recommendations outlined within our paper and documentation</i>. Not only does our paper make the model tuning towards over-prediction clear, we recommend the reader to run inferences from all three model fits and compare the results within the paper on any new data to aid in manual post-processing. Furthermore, in the publicly available tool documentation we indicate how our tool can be used by the reader to tune towards overestimating false negatives should they wish (Abrahams & McKenzie <span>2024</span>).</p><p>In a further claim, Li <i>et al</i>. (<span>2025</span>) state that Abrahams <i>et al</i>. (<span>2024</span>) have not ‘sufficiently explored’ the tradeoff between false positives and false negatives. We disagree, as this tradeoff is established to be well described by the F1 score, which is commonly used to assess performance in the presence of class imbalance (He & Garcia <span>2009</span>; Flach & Kull <span>2015</span>, among others), as described above and shared in the Supporting Information Data S2, where we state ‘for this reason, we primarily focus on F1 score throughout this work’. Furthermore, the ROC curves shown in Fig. 4A (Abrahams <i>et al</i>. <span>2024</span>) explore this argument. Li <i>et al</i>. (<span>2025</span>) also suggest that we should have utilized a precision-recall curve (precision vs. recall) instead of an ROC curve (true positive rate vs. false positive rate) in making our model selection; however, as is highlighted in Flach & Kull (<span>2015</span>) and elsewhere, precision-recall curves can be misleading in cases of class imbalance where, in contrast, it has been widely established that ROC curves are not sensitive to class ratio (Fawcett <span>2006</span>) and therefore a more stable approach. Recently, McDermott <i>et al</i>. (<span>2024</span>) demonstrated that the area under the precision-recall curve is an explicitly biased and discriminatory metric for model selection, and recommended relying on the area under an ROC curve in cases where minimizing false negatives is more important than minimizing false positives, which are the stated needs of our paper.</p><p>Li <i>et al</i>. (<span>2025</span>) disagree with our choice to bias our model fit towards false positives; however, in any ML model there is a tuning choice between biasing the model towards overprediction (more false positives) or underprediction (more false negatives) which are inherently in conflict in imbalanced data sets (Chawla <span>2005</span>). Since we know that post-processing is a viable alternative for any user, and indeed, we recommend this to any practitioner who implements our tool, we choose to tune towards overprediction in order not to miss any true positives. The model output needs to be manually assessed for accuracy, but starting from a data set with a fraction of false positives is preferable for glacial geomorphologists rather than revisiting the raw elevation data to identify missed true positives. If a user, like Li <i>et al</i>. (<span>2025</span>) with their preference towards underprediction, prefers another tuning however, bedfinder (Abrahams & McKenzie <span>2024</span>) can be implemented <i>as is</i> to select a stricter probability threshold in its prediction and therefore be tuned towards false positives. Figure 4B in Abrahams <i>et al</i>. (<span>2024</span>) illustrates the tradeoff between precision and recall as this probability threshold is tuned.</p><p>We are surprised to see that Li <i>et al</i>. (<span>2025</span>) take issue with our choice to filter the data as a preprocessing step in our pipeline, as resampling is a well-established method for preparing imbalanced data for use with ML (Chawla <span>2005</span>; He & Garcia <span>2009</span>, among others). He and Garcia (<span>2009</span>) in ‘Learning from imbalanced data’, recommend undersampling the majority class rather than oversampling to avoid overfitting, and broadly divide this approach into two categories: ‘random undersampling’ and ‘informed undersampling’. Not only did we preliminarily train and test the statistical model on the unfiltered data set (see links to the GitHub repository containing the accompanying analysis files in the Data Availability statement of our paper, Abrahams & McKenzie <span>2024</span>), we also trained and tested the statistical model on random undersampling as defined by He and Garcia (<span>2009</span>), as is shown in Fig. 3 of Abrahams <i>et al</i>. (<span>2024</span>). We found that the model was only able to recover 22% of the true positives without any filtering (Fig. 3 of Abrahams <i>et al</i>. <span>2024</span> shows that random undersampling only recovers 43% of the true positives). These model fits – trained on unfiltered data and on data where the non-glacially derived landforms were randomly undersampled – were biased towards underidentifying true glacially derived bedforms. In both cases, flipping a fair coin for class assignment would provide better recall than the model fits. (Of course, a coin flip would also provide worse precision, since in both of these cases the model fits are biased towards missing true glacially derived bedforms and would therefore assign negatives (both correctly and incorrectly) more often than a fair coin flip would. A coin flip approach would therefore not alleviate the need for manual assignment on the entire input data set any more than models trained on unfiltered or unusefully filtered data, but does illustrate the need for informed filtering in this case). As requested by Li <i>et al</i>. (<span>2025</span>) we reproduce the confusion matrices for these investigations here in Fig. 1, which illustrate how the majority of bedforms are missed while using unfiltered or unusefully filtered data.</p><p>He & Garcia (<span>2009</span>) offer several examples of informed undersampling algorithms, including Near Miss, all of which implement purely statistical approaches to filtering the data, often requiring advanced knowledge of class assignment. Ultimately, after testing several undersampling approaches (see the analysis files accompanying our paper), we settled on Near Miss as it provided the most generalizable results on a withheld test set. Li <i>et al</i>. (<span>2025</span>) state that our approach is missing these ablation studies; however, we tested a variety of filtering approaches, including data cleaning techniques, before finalizing the choice of Near Miss, and all of these ablation studies are available in our accompanying publicly available GitHub repository (Abrahams & McKenzie <span>2024</span>).</p><p>As mentioned above, Near Miss and other informed undersampling algorithms (e.g. He & Garcia <span>2009</span>) rely on statistical filtering approaches to create greater distance between classes within the data. However, when a scientific expert manually labels bedforms, often with a prefilter approach to remove obviously spurious detections, the technique informing this filter is scientifically motivated rather than statistically motivated. We believe that Abrahams <i>et al</i>. (<span>2024</span>) speaks for itself in the power of combining these two approaches. Li <i>et al</i>. (<span>2025</span>) claim that the choices that led to our filtering approach are not transparent, but we would like to remind the reader that the opposite is true: our filtering approach is clearly outlined and probed in the ‘Filtering the non-glacial features to balance classes’ section of the paper and we make these filtering options easily available to the reader as the <i>filtering</i> function in bedfinder, and in the accompanying GitHub repo (Abrahams & McKenzie <span>2024</span>). These improvements in transparency of decision making are a large improvement to the ‘[…] manual classification choices [that vary] from expert to expert [that] can lead to difficult to reproduce, subjective classification schema with associated unquantified error’ (Abrahams <i>et al</i>. <span>2024</span>). We would also like to emphasize that Near Miss <i>is itself</i> a ‘data-driven filtering technique’. In driving an approach that combines statistical filtering with scientific filtering, we have already created an implementation that allows the model to adjust to new data without overfitting. The power of this combined approach allows scientific information to guide the model in a way that can still be generalized to new regions.</p><p>We agree with the astute observation from Li <i>et al</i>. (<span>2025</span>), raised first in Abrahams <i>et al</i>. (<span>2024</span>), that addressing the need for expanded publicly available training sets will refine the model's ability to perform in OOD regions. To showcase our initial argument in support of open science and reproducibility, in writing this response, the authors were able to address all of Li <i>et al</i>.'s (<span>2025</span>) concerns without the need to conduct any new analyses and by providing references to the already existing, fully available data and results from their initial study (Abrahams & McKenzie <span>2024</span>). We thank the authors of the Comments for their support in helping us illustrate this point. By incorporating existing glacial geomorphology training data into the findable, accessible, interoperable, and reproducible (FAIR; Wilkinson <span>2016</span>) data framework, the potential of bedfinder and other powerful geospatial modelling tools to be trained on well-rounded and robust data sets will be strengthened.</p>","PeriodicalId":9184,"journal":{"name":"Boreas","volume":"54 2","pages":"277-280"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bor.70004","citationCount":"0","resultStr":"{\"title\":\"Response to Comment on ‘Automatic identification of streamlined subglacial bedforms using machine learning: an open-source Python approach’\",\"authors\":\"Marion McKenzie, Ellianna Abrahams, Fernando Pérez, Ryan Venturelli\",\"doi\":\"10.1111/bor.70004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Li <i>et al</i>. (<span>2025</span> this issue) state that they identify areas for improvement to our development of bedfinder through including more data sets at training, evaluating our filtering methods, and exploring modular approaches of the tool. Here we respond to these Comments by highlighting where we have already addressed each of these areas within our work and notably, our supporting information (Abrahams <i>et al</i>. <span>2024</span>). In our paper, we describe that bedfinder is an inherently modular tool, allowing a user to choose which components of the pipeline might be useful to them. Furthermore, bedfinder already allows a user to customize the choice to over- or under-predict a glacially derived bedform assignment. In Abrahams <i>et al</i>. (<span>2024</span>) we previously justified why we made the choice to over-predict, emphasizing the need for manual post-processing, which we reiterate here. Finally, we reshare statements from the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>) where we also recommend incorporating additional data in future tool creation to strengthen our approach.</p><p>The objective of Abrahams <i>et al</i>. (<span>2024</span>) was to build an open-source tool that would allow for the automatic detection of glacially derived streamlined subglacial bedforms, based on the previous successes of manual approaches (e.g. Clark <span>1993</span>; Greenwood & Clark <span>2008</span>; Spagnolo <i>et al</i>. <span>2014</span>; Ely <i>et al</i>. <span>2016</span>; Principato <i>et al</i>. <span>2016</span>; Clark <i>et al</i>. <span>2018</span>). To develop this tool, we used Random Forest (Breiman <span>2001</span>), XGBoost (Chen & Guestrin <span>2016</span>), and an ensemble average of these two model fits on a publicly available training data set of nearly 600 000 data points across the deglaciated Northern Hemisphere (McKenzie <i>et al</i>. <span>2022</span>). Li <i>et al</i>. (<span>2025</span>) suggest a constructive critique to our approach stating our work should include more data sets at training, further evaluate our filtering methods, and explore better modularizing bedfinder. However, these suggestions have either already been implemented as existing features of bedfinder (i.e. modularity of pipeline components and tunability of bedform predictions to the needs of the user) or we have already named them in our study as known limitations of the tool that will require wider community participation (the creation of larger, more nuanced machine learning data sets for model training).</p><p>Abrahams <i>et al</i>. (<span>2024</span>) outlined the limitations of the presented approach in the section ‘Model limitations’. There, we state that the ‘TPI tool used to compile our training set performs most poorly in regions with highly elongate bedforms with low surface relief (McKenzie <i>et al</i>. <span>2022</span>)’ (i.e. bedforms across crystalline bedrock surfaces) (Abrahams <i>et al</i>. <span>2024</span>), which Li <i>et al</i>. (<span>2025</span>) also note as a limitation in their Comments. Further comments in Li <i>et al</i>. (<span>2025</span>) reflect the performance and strengths of the TPI tool, which is not the focus of Abrahams <i>et al</i>. (<span>2024</span>) and is instead the focus of McKenzie <i>et al</i>. (<span>2022</span>). Even so, justification for the choice to make binary classifications of the data is explained within the ‘Model limitations’ section of Abrahams <i>et al</i>. (<span>2024</span>). The solution Li <i>et al</i>. (<span>2025</span>) present to overcome the training data set limitations is one that was explicitly outlined within the ‘Future directions for tool advancements’ section in Abrahams <i>et al</i>. (<span>2024</span>). Almost identically to what Li <i>et al</i>. (<span>2025</span>) present as their solution to this problem, we express within our original work that ‘future incorporation of additional training data that increase representation of low relief and multi-directional ice flow’ will help overcome these deficiencies in bedfinder (Abrahams <i>et al</i>. <span>2024</span>).</p><p>In their Comments, Li <i>et al</i>. (<span>2025</span>) claim that the performance metrics used in Abrahams <i>et al</i>. (<span>2024</span>) obfuscate true performance given the inherent class imbalance within the training data, despite the fact that this exact concern is the main focus of the Supporting Information Data S2. There we highlight our focus on the F1 score as a preferred metric for model selection in the case of class imbalance because its usage ‘prevents an overly optimistic assessment’ and is ‘balanced towards a low rate of false positives and false negatives’ (Abrahams <i>et al</i>. <span>2024</span>). We direct the interested reader to He and Garcia (<span>2009</span>) to read more about the usefulness of the F1 score in the case of class imbalance (Flach & Kull <span>2015</span> is another excellent resource).</p><p>We would like to emphasize, as should be implicitly understood, that the performance metrics in Table 2 of Abrahams <i>et al</i>. (<span>2024</span>) are only achievable on tests within the distribution of the training set. To investigate potential distribution shift, we validate bedfinder on an out-of-distribution (OOD) sample, as is detailed in the section titled ‘Model validation through a subset of Green Bay Lobe bedforms’, finding that bedfinder, as it is tuned within the paper, recovers >79% of the true positives there. Li <i>et al</i>. (<span>2025</span>) make the recommendation to withhold individual regions as test data, and we are sure that they will be happy to see that this analysis is already completed in that section of Abrahams <i>et al</i>. (<span>2024</span>). In the paper, we shared the accuracy achieved along with the ROC curves for the new region, and in our publicly available accompanying materials (Abrahams & McKenzie <span>2024</span>) we shared recall, precision, and F1 score. We take this opportunity to reiterate that ‘the choice of sites for the training data set limits the model's ability to extrapolate results to new regions with topographic constraints or bedrock types that are out-of-distribution (OOD). Any applications of the tools in this paper to OOD data are statistically unreliable but will still provide a starting point to analysing presence of streamlined subglacial bedforms across a deglaciated landscape’ (Abrahams <i>et al</i>. <span>2024</span>).</p><p>As stated, bedfinder is not intended as the only solution, but rather a starting point for scientific practitioners hoping to automate some of the process of describing landforms within a deglaciated landscape. In claiming that Abrahams <i>et al</i>. (<span>2024</span>) misrepresent the implications of tuning our model approach towards overestimating false positives (which leads to the inclusion of mistaken detections) instead of tuning towards overestimating false negatives (which leads to the exclusion of definite bedforms at the distribution's edge), Li <i>et al</i>. (<span>2025</span>) seem to suggest that users would indiscriminately apply this tool without critically evaluating its outcomes, <i>which would explicitly go against the usage recommendations outlined within our paper and documentation</i>. Not only does our paper make the model tuning towards over-prediction clear, we recommend the reader to run inferences from all three model fits and compare the results within the paper on any new data to aid in manual post-processing. Furthermore, in the publicly available tool documentation we indicate how our tool can be used by the reader to tune towards overestimating false negatives should they wish (Abrahams & McKenzie <span>2024</span>).</p><p>In a further claim, Li <i>et al</i>. (<span>2025</span>) state that Abrahams <i>et al</i>. (<span>2024</span>) have not ‘sufficiently explored’ the tradeoff between false positives and false negatives. We disagree, as this tradeoff is established to be well described by the F1 score, which is commonly used to assess performance in the presence of class imbalance (He & Garcia <span>2009</span>; Flach & Kull <span>2015</span>, among others), as described above and shared in the Supporting Information Data S2, where we state ‘for this reason, we primarily focus on F1 score throughout this work’. Furthermore, the ROC curves shown in Fig. 4A (Abrahams <i>et al</i>. <span>2024</span>) explore this argument. Li <i>et al</i>. (<span>2025</span>) also suggest that we should have utilized a precision-recall curve (precision vs. recall) instead of an ROC curve (true positive rate vs. false positive rate) in making our model selection; however, as is highlighted in Flach & Kull (<span>2015</span>) and elsewhere, precision-recall curves can be misleading in cases of class imbalance where, in contrast, it has been widely established that ROC curves are not sensitive to class ratio (Fawcett <span>2006</span>) and therefore a more stable approach. Recently, McDermott <i>et al</i>. (<span>2024</span>) demonstrated that the area under the precision-recall curve is an explicitly biased and discriminatory metric for model selection, and recommended relying on the area under an ROC curve in cases where minimizing false negatives is more important than minimizing false positives, which are the stated needs of our paper.</p><p>Li <i>et al</i>. (<span>2025</span>) disagree with our choice to bias our model fit towards false positives; however, in any ML model there is a tuning choice between biasing the model towards overprediction (more false positives) or underprediction (more false negatives) which are inherently in conflict in imbalanced data sets (Chawla <span>2005</span>). Since we know that post-processing is a viable alternative for any user, and indeed, we recommend this to any practitioner who implements our tool, we choose to tune towards overprediction in order not to miss any true positives. The model output needs to be manually assessed for accuracy, but starting from a data set with a fraction of false positives is preferable for glacial geomorphologists rather than revisiting the raw elevation data to identify missed true positives. If a user, like Li <i>et al</i>. (<span>2025</span>) with their preference towards underprediction, prefers another tuning however, bedfinder (Abrahams & McKenzie <span>2024</span>) can be implemented <i>as is</i> to select a stricter probability threshold in its prediction and therefore be tuned towards false positives. Figure 4B in Abrahams <i>et al</i>. (<span>2024</span>) illustrates the tradeoff between precision and recall as this probability threshold is tuned.</p><p>We are surprised to see that Li <i>et al</i>. (<span>2025</span>) take issue with our choice to filter the data as a preprocessing step in our pipeline, as resampling is a well-established method for preparing imbalanced data for use with ML (Chawla <span>2005</span>; He & Garcia <span>2009</span>, among others). He and Garcia (<span>2009</span>) in ‘Learning from imbalanced data’, recommend undersampling the majority class rather than oversampling to avoid overfitting, and broadly divide this approach into two categories: ‘random undersampling’ and ‘informed undersampling’. Not only did we preliminarily train and test the statistical model on the unfiltered data set (see links to the GitHub repository containing the accompanying analysis files in the Data Availability statement of our paper, Abrahams & McKenzie <span>2024</span>), we also trained and tested the statistical model on random undersampling as defined by He and Garcia (<span>2009</span>), as is shown in Fig. 3 of Abrahams <i>et al</i>. (<span>2024</span>). We found that the model was only able to recover 22% of the true positives without any filtering (Fig. 3 of Abrahams <i>et al</i>. <span>2024</span> shows that random undersampling only recovers 43% of the true positives). These model fits – trained on unfiltered data and on data where the non-glacially derived landforms were randomly undersampled – were biased towards underidentifying true glacially derived bedforms. In both cases, flipping a fair coin for class assignment would provide better recall than the model fits. (Of course, a coin flip would also provide worse precision, since in both of these cases the model fits are biased towards missing true glacially derived bedforms and would therefore assign negatives (both correctly and incorrectly) more often than a fair coin flip would. A coin flip approach would therefore not alleviate the need for manual assignment on the entire input data set any more than models trained on unfiltered or unusefully filtered data, but does illustrate the need for informed filtering in this case). As requested by Li <i>et al</i>. (<span>2025</span>) we reproduce the confusion matrices for these investigations here in Fig. 1, which illustrate how the majority of bedforms are missed while using unfiltered or unusefully filtered data.</p><p>He & Garcia (<span>2009</span>) offer several examples of informed undersampling algorithms, including Near Miss, all of which implement purely statistical approaches to filtering the data, often requiring advanced knowledge of class assignment. Ultimately, after testing several undersampling approaches (see the analysis files accompanying our paper), we settled on Near Miss as it provided the most generalizable results on a withheld test set. Li <i>et al</i>. (<span>2025</span>) state that our approach is missing these ablation studies; however, we tested a variety of filtering approaches, including data cleaning techniques, before finalizing the choice of Near Miss, and all of these ablation studies are available in our accompanying publicly available GitHub repository (Abrahams & McKenzie <span>2024</span>).</p><p>As mentioned above, Near Miss and other informed undersampling algorithms (e.g. He & Garcia <span>2009</span>) rely on statistical filtering approaches to create greater distance between classes within the data. However, when a scientific expert manually labels bedforms, often with a prefilter approach to remove obviously spurious detections, the technique informing this filter is scientifically motivated rather than statistically motivated. We believe that Abrahams <i>et al</i>. (<span>2024</span>) speaks for itself in the power of combining these two approaches. Li <i>et al</i>. (<span>2025</span>) claim that the choices that led to our filtering approach are not transparent, but we would like to remind the reader that the opposite is true: our filtering approach is clearly outlined and probed in the ‘Filtering the non-glacial features to balance classes’ section of the paper and we make these filtering options easily available to the reader as the <i>filtering</i> function in bedfinder, and in the accompanying GitHub repo (Abrahams & McKenzie <span>2024</span>). These improvements in transparency of decision making are a large improvement to the ‘[…] manual classification choices [that vary] from expert to expert [that] can lead to difficult to reproduce, subjective classification schema with associated unquantified error’ (Abrahams <i>et al</i>. <span>2024</span>). We would also like to emphasize that Near Miss <i>is itself</i> a ‘data-driven filtering technique’. In driving an approach that combines statistical filtering with scientific filtering, we have already created an implementation that allows the model to adjust to new data without overfitting. The power of this combined approach allows scientific information to guide the model in a way that can still be generalized to new regions.</p><p>We agree with the astute observation from Li <i>et al</i>. (<span>2025</span>), raised first in Abrahams <i>et al</i>. (<span>2024</span>), that addressing the need for expanded publicly available training sets will refine the model's ability to perform in OOD regions. To showcase our initial argument in support of open science and reproducibility, in writing this response, the authors were able to address all of Li <i>et al</i>.'s (<span>2025</span>) concerns without the need to conduct any new analyses and by providing references to the already existing, fully available data and results from their initial study (Abrahams & McKenzie <span>2024</span>). We thank the authors of the Comments for their support in helping us illustrate this point. By incorporating existing glacial geomorphology training data into the findable, accessible, interoperable, and reproducible (FAIR; Wilkinson <span>2016</span>) data framework, the potential of bedfinder and other powerful geospatial modelling tools to be trained on well-rounded and robust data sets will be strengthened.</p>\",\"PeriodicalId\":9184,\"journal\":{\"name\":\"Boreas\",\"volume\":\"54 2\",\"pages\":\"277-280\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bor.70004\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Boreas\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/bor.70004\",\"RegionNum\":3,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"GEOGRAPHY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Boreas","FirstCategoryId":"89","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bor.70004","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
摘要
Li等人(2025年本刊)指出,他们通过在训练中纳入更多数据集,评估我们的过滤方法,以及探索工具的模块化方法,确定了我们在床寻床器开发中需要改进的领域。在这里,我们通过强调我们在工作中已经解决的每个领域,特别是我们的支持信息来回应这些评论(亚伯拉罕等人,2024)。在我们的论文中,我们描述了寻床器是一个固有的模块化工具,允许用户选择哪些管道组件可能对他们有用。此外,寻床器已经允许用户自定义选择,以高估或低估冰川衍生的床型分配。在亚伯拉罕等人(2024)中,我们之前证明了为什么我们选择过度预测,强调需要手动后处理,我们在这里重申。最后,我们重新分享了Abrahams等人(2024)的“模型限制”部分的陈述,我们还建议在未来的工具创建中加入额外的数据,以加强我们的方法。亚伯拉罕等人(2024)的目标是建立一个开源工具,该工具将允许自动检测冰川衍生的流线型冰下河床,基于先前的成功的人工方法(例如Clark 1993;格林伍德,克拉克2008年;Spagnolo et al. 2014;Ely et al. 2016;Principato et al. 2016;Clark et al. 2018)。为了开发这个工具,我们使用了Random Forest (Breiman 2001), XGBoost (Chen &;Guestrin 2016),这两个模型的集合平均值可以在一个公开可用的训练数据集上进行拟合,该数据集包含北半球冰川消融的近60万个数据点(McKenzie et al. 2022)。Li等人(2025)对我们的方法提出了建设性的批评,指出我们的工作应该包括更多的训练数据集,进一步评估我们的过滤方法,并探索更好的模块化寻床器。然而,这些建议要么已经作为bedfinder的现有功能(即管道组件的模块化和对用户需求的床型预测的可调性)实现,要么我们已经在我们的研究中将它们命名为需要更广泛社区参与的工具的已知限制(创建更大,更细致的机器学习数据集用于模型训练)。亚伯拉罕等人(2024)在“模型限制”一节中概述了所提出方法的局限性。在这里,我们声明“用于编译我们的训练集的TPI工具在具有低表面起伏的高度细长的床型区域(McKenzie等人,2022)(即跨越结晶基岩表面的床型)(Abrahams等人,2024)中表现最差”,Li等人(2025)也在他们的评论中指出了这一点。Li等人(2025)的进一步评论反映了TPI工具的性能和优势,这不是Abrahams等人(2024)的重点,而是McKenzie等人(2022)的重点。即便如此,选择对数据进行二元分类的理由在亚伯拉罕等人(2024)的“模型限制”部分得到了解释。Li等人(2025)提出的克服训练数据集限制的解决方案在Abrahams等人(2024)的“工具进步的未来方向”一节中明确概述。与Li et al.(2025)提出的解决方案几乎相同,我们在原始工作中表示,“未来纳入额外的训练数据,增加低地形和多向冰流的代表性”将有助于克服bedfinder的这些缺陷(Abrahams et al. 2024)。在他们的评论中,Li等人(2025)声称,鉴于训练数据中固有的类不平衡,Abrahams等人(2024)使用的性能指标混淆了真实的性能,尽管事实上这种关注正是支持信息数据S2的主要焦点。在那里,我们强调了我们对F1分数的关注,将其作为类别不平衡情况下模型选择的首选指标,因为它的使用“防止了过度乐观的评估”,并且“平衡了低假阳性和假阴性率”(Abrahams等人,2024)。我们建议感兴趣的读者阅读He和Garcia(2009),以了解更多关于F1分数在阶级失衡情况下的有用性(Flach &;库尔2015是另一个很好的资源)。我们想强调的是,应该含蓄地理解,Abrahams等人(2024)的表2中的性能指标只能在训练集分布内的测试中实现。为了研究潜在的分布转移,我们在分布外(OOD)样本上验证了寻床器,详见“通过Green Bay Lobe地层子集进行模型验证”一节,发现寻床器在论文中进行调整后,在那里恢复了79%的真阳性。Li等人。 (2025)建议保留个别地区作为测试数据,我们确信他们会很高兴看到亚伯拉罕等人(2024)的那一部分已经完成了这一分析。在本文中,我们分享了新区域的ROC曲线和我们公开提供的随附材料(abraham &;麦肯齐2024)我们分享了召回率、准确率和F1分数。我们借此机会重申,训练数据集的地点选择限制了模型将结果外推到具有地形约束或基岩类型不在分布范围(OOD)的新区域的能力。本文中工具对OOD数据的任何应用在统计上都是不可靠的,但仍将为分析冰川消融景观中流线型冰下河床的存在提供一个起点”(Abrahams et al. 2024)。如上所述,寻床仪并不是唯一的解决方案,而是科学实践者希望在冰川消退的景观中自动化描述地形的一些过程的起点。在声称Abrahams等人(2024)歪曲了将我们的模型方法调整为高估假阳性(导致包含错误检测)而不是调整为高估假阴性(导致排除分布边缘的确定形态)的含义时,Li等人(2025)似乎暗示用户会不加选择地应用此工具而不仔细评估其结果。这显然违背了我们的论文和文档中概述的使用建议。我们的论文不仅明确了模型对过度预测的调整,我们还建议读者从所有三个模型拟合中进行推断,并在任何新数据上比较论文中的结果,以帮助手动后处理。此外,在公开可用的工具文档中,我们指出了读者如何使用我们的工具来调整他们希望高估的假阴性(亚伯拉罕和;麦肯齐2024)。在进一步的声明中,Li等人(2025)指出,亚伯拉罕等人(2024)没有“充分探索”假阳性和假阴性之间的权衡。我们不同意这种观点,因为F1分数可以很好地描述这种权衡,F1分数通常用于评估班级不平衡情况下的表现(He &;加西亚2009;Flach,如上所述,并在支持信息数据S2中共享,其中我们声明“出于这个原因,我们在整个工作中主要关注F1分数”。此外,图4A所示的ROC曲线(Abrahams et al. 2024)探讨了这一论点。Li等人(2025)还建议,在进行模型选择时,我们应该使用精度-召回率曲线(精度vs召回率)而不是ROC曲线(真阳性率vs假阳性率);然而,正如Flach &;Kull(2015)和其他地方,在类别不平衡的情况下,精确召回率曲线可能会产生误导,相反,已经广泛建立的ROC曲线对类别比例不敏感(Fawcett 2006),因此是一种更稳定的方法。最近,McDermott等人(2024)证明,精确召回率曲线下的面积是模型选择的一个明确的偏见和歧视指标,并建议在最小化假阴性比最小化假阳性更重要的情况下依赖ROC曲线下的面积,这是我们论文所陈述的需求。Li等人(2025)不同意我们将模型拟合偏向假阳性的选择;然而,在任何ML模型中,都有一个调整选择,将模型偏向于过度预测(更多假阳性)或低估(更多假阴性),这在不平衡数据集中是固有的冲突(Chawla 2005)。因为我们知道后处理对任何用户来说都是一个可行的选择,事实上,我们向任何实现我们工具的从业者推荐这一点,为了不错过任何真正的积极因素,我们选择调整到过度预测。模型输出的准确性需要人工评估,但对于冰川地貌学家来说,从带有少量假阳性的数据集开始比重新访问原始海拔数据来识别遗漏的真阳性要好。如果一个用户,像Li等人(2025)一样,倾向于预测不足,那么他们更喜欢另一种调谐,即寻床者(亚伯拉罕&;McKenzie 2024)可以这样实现,即在其预测中选择更严格的概率阈值,从而调整为误报。Abrahams等人(2024)的图4B说明了在调整概率阈值时精度和召回率之间的权衡。我们很惊讶地看到Li等人。 (2025)对我们选择过滤数据作为我们管道中的预处理步骤提出质疑,因为重新采样是一种成熟的方法,用于准备用于ML的不平衡数据(Chawla 2005;他,Garcia, 2009)。他和Garcia(2009)在“从不平衡数据中学习”中建议对大多数类进行欠采样,而不是过采样,以避免过拟合,并将这种方法大致分为两类:“随机欠采样”和“知情欠采样”。我们不仅在未过滤的数据集上对统计模型进行了初步训练和测试(见本文数据可用性声明中包含附带分析文件的GitHub存储库链接,Abrahams &;McKenzie 2024),我们还对He和Garcia(2009)定义的随机欠抽样统计模型进行了训练和测试,如图abraham et al.(2024)的图3所示。我们发现,在没有任何滤波的情况下,该模型只能恢复22%的真阳性(Abrahams et al. 2024的图3显示,随机欠采样只能恢复43%的真阳性)。这些模型拟合是在未经过滤的数据和非冰川衍生地形随机采样不足的数据上进行训练的,它们倾向于低估真正的冰川衍生地形。在这两种情况下,抛硬币进行课堂作业将提供比模型拟合更好的召回率。(当然,抛硬币也会提供更差的精度,因为在这两种情况下,模型拟合都倾向于错过真正的冰川衍生的河床,因此会比抛硬币更经常地给出否定的结果(正确和不正确)。因此,抛硬币的方法不会减轻对整个输入数据集的手动分配的需要,而不是在未经过滤或无用过滤的数据上训练的模型,但在这种情况下确实说明了知情过滤的需要)。根据Li等人(2025)的要求,我们在图1中再现了这些调查的混淆矩阵,这说明了在使用未过滤或无用过滤的数据时,大多数形式是如何被遗漏的。他,Garcia(2009)提供了几个知情欠采样算法的例子,包括Near Miss,所有这些算法都采用纯粹的统计方法来过滤数据,通常需要对班级分配有深入的了解。最终,在测试了几种欠采样方法(参见本文附带的分析文件)之后,我们决定使用Near Miss,因为它在保留的测试集上提供了最一般化的结果。Li等人(2025)指出,我们的方法缺少这些消融研究;然而,在最终选择Near Miss之前,我们测试了各种过滤方法,包括数据清理技术,所有这些消融研究都可以在我们随附的公开GitHub存储库中获得(abraham &;麦肯齐2024)。如上所述,Near Miss和其他知情欠采样算法(例如He &;Garcia 2009)依靠统计过滤方法在数据中创建更大的类之间的距离。然而,当科学专家手动标记床型时,通常使用预过滤方法来去除明显的虚假检测,通知该过滤器的技术是科学动机而不是统计动机。我们相信,亚伯拉罕等人(2024)在结合这两种方法的力量中不言自明。Li等人(2025)声称,导致我们的过滤方法的选择是不透明的,但我们想提醒读者,事实恰恰相反:我们的过滤方法在论文的“过滤非冰川特征以平衡类”部分中得到了清晰的概述和探讨,我们使这些过滤选项作为bedfinder中的过滤功能很容易提供给读者,并在附带的GitHub repo(亚伯拉罕&;麦肯齐2024)。决策透明度方面的这些改进是对“[…]专家之间的人工分类选择[不同]可能导致难以重现的主观分类模式和相关的未量化错误”的巨大改进(Abrahams et al. 2024)。我们还想强调的是,Near Miss本身就是一种“数据驱动的过滤技术”。在推动将统计过滤与科学过滤相结合的方法时,我们已经创建了一个实现,允许模型在不过度拟合的情况下适应新数据。这种综合方法的力量允许科学信息以一种仍然可以推广到新区域的方式指导模型。我们同意Li等人(2025)的敏锐观察,该观察首先在Abrahams等人(2024)中提出,即解决对扩展公开可用训练集的需求将改进模型在OOD区域的执行能力。 为了展示我们最初支持开放科学和可重复性的论点,在撰写此回复时,作者能够解决Li等人(2025)的所有问题,而无需进行任何新的分析,并且通过提供参考已经存在的,完全可用的数据和他们最初研究的结果(Abrahams &;麦肯齐2024)。我们感谢评论作者的支持,他们帮助我们说明了这一点。通过将现有的冰川地貌训练数据纳入可查找、可获取、可互操作和可复制的(FAIR;Wilkinson(2016)数据框架、取景器和其他强大的地理空间建模工具的潜力将得到加强,这些工具将在全面和强大的数据集上进行培训。
Response to Comment on ‘Automatic identification of streamlined subglacial bedforms using machine learning: an open-source Python approach’
Li et al. (2025 this issue) state that they identify areas for improvement to our development of bedfinder through including more data sets at training, evaluating our filtering methods, and exploring modular approaches of the tool. Here we respond to these Comments by highlighting where we have already addressed each of these areas within our work and notably, our supporting information (Abrahams et al. 2024). In our paper, we describe that bedfinder is an inherently modular tool, allowing a user to choose which components of the pipeline might be useful to them. Furthermore, bedfinder already allows a user to customize the choice to over- or under-predict a glacially derived bedform assignment. In Abrahams et al. (2024) we previously justified why we made the choice to over-predict, emphasizing the need for manual post-processing, which we reiterate here. Finally, we reshare statements from the ‘Model limitations’ section of Abrahams et al. (2024) where we also recommend incorporating additional data in future tool creation to strengthen our approach.
The objective of Abrahams et al. (2024) was to build an open-source tool that would allow for the automatic detection of glacially derived streamlined subglacial bedforms, based on the previous successes of manual approaches (e.g. Clark 1993; Greenwood & Clark 2008; Spagnolo et al. 2014; Ely et al. 2016; Principato et al. 2016; Clark et al. 2018). To develop this tool, we used Random Forest (Breiman 2001), XGBoost (Chen & Guestrin 2016), and an ensemble average of these two model fits on a publicly available training data set of nearly 600 000 data points across the deglaciated Northern Hemisphere (McKenzie et al. 2022). Li et al. (2025) suggest a constructive critique to our approach stating our work should include more data sets at training, further evaluate our filtering methods, and explore better modularizing bedfinder. However, these suggestions have either already been implemented as existing features of bedfinder (i.e. modularity of pipeline components and tunability of bedform predictions to the needs of the user) or we have already named them in our study as known limitations of the tool that will require wider community participation (the creation of larger, more nuanced machine learning data sets for model training).
Abrahams et al. (2024) outlined the limitations of the presented approach in the section ‘Model limitations’. There, we state that the ‘TPI tool used to compile our training set performs most poorly in regions with highly elongate bedforms with low surface relief (McKenzie et al. 2022)’ (i.e. bedforms across crystalline bedrock surfaces) (Abrahams et al. 2024), which Li et al. (2025) also note as a limitation in their Comments. Further comments in Li et al. (2025) reflect the performance and strengths of the TPI tool, which is not the focus of Abrahams et al. (2024) and is instead the focus of McKenzie et al. (2022). Even so, justification for the choice to make binary classifications of the data is explained within the ‘Model limitations’ section of Abrahams et al. (2024). The solution Li et al. (2025) present to overcome the training data set limitations is one that was explicitly outlined within the ‘Future directions for tool advancements’ section in Abrahams et al. (2024). Almost identically to what Li et al. (2025) present as their solution to this problem, we express within our original work that ‘future incorporation of additional training data that increase representation of low relief and multi-directional ice flow’ will help overcome these deficiencies in bedfinder (Abrahams et al. 2024).
In their Comments, Li et al. (2025) claim that the performance metrics used in Abrahams et al. (2024) obfuscate true performance given the inherent class imbalance within the training data, despite the fact that this exact concern is the main focus of the Supporting Information Data S2. There we highlight our focus on the F1 score as a preferred metric for model selection in the case of class imbalance because its usage ‘prevents an overly optimistic assessment’ and is ‘balanced towards a low rate of false positives and false negatives’ (Abrahams et al. 2024). We direct the interested reader to He and Garcia (2009) to read more about the usefulness of the F1 score in the case of class imbalance (Flach & Kull 2015 is another excellent resource).
We would like to emphasize, as should be implicitly understood, that the performance metrics in Table 2 of Abrahams et al. (2024) are only achievable on tests within the distribution of the training set. To investigate potential distribution shift, we validate bedfinder on an out-of-distribution (OOD) sample, as is detailed in the section titled ‘Model validation through a subset of Green Bay Lobe bedforms’, finding that bedfinder, as it is tuned within the paper, recovers >79% of the true positives there. Li et al. (2025) make the recommendation to withhold individual regions as test data, and we are sure that they will be happy to see that this analysis is already completed in that section of Abrahams et al. (2024). In the paper, we shared the accuracy achieved along with the ROC curves for the new region, and in our publicly available accompanying materials (Abrahams & McKenzie 2024) we shared recall, precision, and F1 score. We take this opportunity to reiterate that ‘the choice of sites for the training data set limits the model's ability to extrapolate results to new regions with topographic constraints or bedrock types that are out-of-distribution (OOD). Any applications of the tools in this paper to OOD data are statistically unreliable but will still provide a starting point to analysing presence of streamlined subglacial bedforms across a deglaciated landscape’ (Abrahams et al. 2024).
As stated, bedfinder is not intended as the only solution, but rather a starting point for scientific practitioners hoping to automate some of the process of describing landforms within a deglaciated landscape. In claiming that Abrahams et al. (2024) misrepresent the implications of tuning our model approach towards overestimating false positives (which leads to the inclusion of mistaken detections) instead of tuning towards overestimating false negatives (which leads to the exclusion of definite bedforms at the distribution's edge), Li et al. (2025) seem to suggest that users would indiscriminately apply this tool without critically evaluating its outcomes, which would explicitly go against the usage recommendations outlined within our paper and documentation. Not only does our paper make the model tuning towards over-prediction clear, we recommend the reader to run inferences from all three model fits and compare the results within the paper on any new data to aid in manual post-processing. Furthermore, in the publicly available tool documentation we indicate how our tool can be used by the reader to tune towards overestimating false negatives should they wish (Abrahams & McKenzie 2024).
In a further claim, Li et al. (2025) state that Abrahams et al. (2024) have not ‘sufficiently explored’ the tradeoff between false positives and false negatives. We disagree, as this tradeoff is established to be well described by the F1 score, which is commonly used to assess performance in the presence of class imbalance (He & Garcia 2009; Flach & Kull 2015, among others), as described above and shared in the Supporting Information Data S2, where we state ‘for this reason, we primarily focus on F1 score throughout this work’. Furthermore, the ROC curves shown in Fig. 4A (Abrahams et al. 2024) explore this argument. Li et al. (2025) also suggest that we should have utilized a precision-recall curve (precision vs. recall) instead of an ROC curve (true positive rate vs. false positive rate) in making our model selection; however, as is highlighted in Flach & Kull (2015) and elsewhere, precision-recall curves can be misleading in cases of class imbalance where, in contrast, it has been widely established that ROC curves are not sensitive to class ratio (Fawcett 2006) and therefore a more stable approach. Recently, McDermott et al. (2024) demonstrated that the area under the precision-recall curve is an explicitly biased and discriminatory metric for model selection, and recommended relying on the area under an ROC curve in cases where minimizing false negatives is more important than minimizing false positives, which are the stated needs of our paper.
Li et al. (2025) disagree with our choice to bias our model fit towards false positives; however, in any ML model there is a tuning choice between biasing the model towards overprediction (more false positives) or underprediction (more false negatives) which are inherently in conflict in imbalanced data sets (Chawla 2005). Since we know that post-processing is a viable alternative for any user, and indeed, we recommend this to any practitioner who implements our tool, we choose to tune towards overprediction in order not to miss any true positives. The model output needs to be manually assessed for accuracy, but starting from a data set with a fraction of false positives is preferable for glacial geomorphologists rather than revisiting the raw elevation data to identify missed true positives. If a user, like Li et al. (2025) with their preference towards underprediction, prefers another tuning however, bedfinder (Abrahams & McKenzie 2024) can be implemented as is to select a stricter probability threshold in its prediction and therefore be tuned towards false positives. Figure 4B in Abrahams et al. (2024) illustrates the tradeoff between precision and recall as this probability threshold is tuned.
We are surprised to see that Li et al. (2025) take issue with our choice to filter the data as a preprocessing step in our pipeline, as resampling is a well-established method for preparing imbalanced data for use with ML (Chawla 2005; He & Garcia 2009, among others). He and Garcia (2009) in ‘Learning from imbalanced data’, recommend undersampling the majority class rather than oversampling to avoid overfitting, and broadly divide this approach into two categories: ‘random undersampling’ and ‘informed undersampling’. Not only did we preliminarily train and test the statistical model on the unfiltered data set (see links to the GitHub repository containing the accompanying analysis files in the Data Availability statement of our paper, Abrahams & McKenzie 2024), we also trained and tested the statistical model on random undersampling as defined by He and Garcia (2009), as is shown in Fig. 3 of Abrahams et al. (2024). We found that the model was only able to recover 22% of the true positives without any filtering (Fig. 3 of Abrahams et al. 2024 shows that random undersampling only recovers 43% of the true positives). These model fits – trained on unfiltered data and on data where the non-glacially derived landforms were randomly undersampled – were biased towards underidentifying true glacially derived bedforms. In both cases, flipping a fair coin for class assignment would provide better recall than the model fits. (Of course, a coin flip would also provide worse precision, since in both of these cases the model fits are biased towards missing true glacially derived bedforms and would therefore assign negatives (both correctly and incorrectly) more often than a fair coin flip would. A coin flip approach would therefore not alleviate the need for manual assignment on the entire input data set any more than models trained on unfiltered or unusefully filtered data, but does illustrate the need for informed filtering in this case). As requested by Li et al. (2025) we reproduce the confusion matrices for these investigations here in Fig. 1, which illustrate how the majority of bedforms are missed while using unfiltered or unusefully filtered data.
He & Garcia (2009) offer several examples of informed undersampling algorithms, including Near Miss, all of which implement purely statistical approaches to filtering the data, often requiring advanced knowledge of class assignment. Ultimately, after testing several undersampling approaches (see the analysis files accompanying our paper), we settled on Near Miss as it provided the most generalizable results on a withheld test set. Li et al. (2025) state that our approach is missing these ablation studies; however, we tested a variety of filtering approaches, including data cleaning techniques, before finalizing the choice of Near Miss, and all of these ablation studies are available in our accompanying publicly available GitHub repository (Abrahams & McKenzie 2024).
As mentioned above, Near Miss and other informed undersampling algorithms (e.g. He & Garcia 2009) rely on statistical filtering approaches to create greater distance between classes within the data. However, when a scientific expert manually labels bedforms, often with a prefilter approach to remove obviously spurious detections, the technique informing this filter is scientifically motivated rather than statistically motivated. We believe that Abrahams et al. (2024) speaks for itself in the power of combining these two approaches. Li et al. (2025) claim that the choices that led to our filtering approach are not transparent, but we would like to remind the reader that the opposite is true: our filtering approach is clearly outlined and probed in the ‘Filtering the non-glacial features to balance classes’ section of the paper and we make these filtering options easily available to the reader as the filtering function in bedfinder, and in the accompanying GitHub repo (Abrahams & McKenzie 2024). These improvements in transparency of decision making are a large improvement to the ‘[…] manual classification choices [that vary] from expert to expert [that] can lead to difficult to reproduce, subjective classification schema with associated unquantified error’ (Abrahams et al. 2024). We would also like to emphasize that Near Miss is itself a ‘data-driven filtering technique’. In driving an approach that combines statistical filtering with scientific filtering, we have already created an implementation that allows the model to adjust to new data without overfitting. The power of this combined approach allows scientific information to guide the model in a way that can still be generalized to new regions.
We agree with the astute observation from Li et al. (2025), raised first in Abrahams et al. (2024), that addressing the need for expanded publicly available training sets will refine the model's ability to perform in OOD regions. To showcase our initial argument in support of open science and reproducibility, in writing this response, the authors were able to address all of Li et al.'s (2025) concerns without the need to conduct any new analyses and by providing references to the already existing, fully available data and results from their initial study (Abrahams & McKenzie 2024). We thank the authors of the Comments for their support in helping us illustrate this point. By incorporating existing glacial geomorphology training data into the findable, accessible, interoperable, and reproducible (FAIR; Wilkinson 2016) data framework, the potential of bedfinder and other powerful geospatial modelling tools to be trained on well-rounded and robust data sets will be strengthened.
期刊介绍:
Boreas has been published since 1972. Articles of wide international interest from all branches of Quaternary research are published. Biological as well as non-biological aspects of the Quaternary environment, in both glaciated and non-glaciated areas, are dealt with: Climate, shore displacement, glacial features, landforms, sediments, organisms and their habitat, and stratigraphical and chronological relationships.
Anticipated international interest, at least within a continent or a considerable part of it, is a main criterion for the acceptance of papers. Besides articles, short items like discussion contributions and book reviews are published.