{"title":"A Python package based on robust statistical analysis for serial crystallography data processing.","authors":"Marjan Hadian-Jazi, Alireza Sadri","doi":"10.1107/S2059798323005855","DOIUrl":null,"url":null,"abstract":"<p><p>The term robustness in statistics refers to methods that are generally insensitive to deviations from model assumptions. In other words, robust methods are able to preserve their accuracy even when the data do not perfectly fit the statistical models. Robust statistical analyses are particularly effective when analysing mixtures of probability distributions. Therefore, these methods enable the discretization of X-ray serial crystallography data into two probability distributions: a group comprising true data points (for example the background intensities) and another group comprising outliers (for example Bragg peaks or bad pixels on an X-ray detector). These characteristics of robust statistical analysis are beneficial for the ever-increasing volume of serial crystallography (SX) data sets produced at synchrotron and X-ray free-electron laser (XFEL) sources. The key advantage of the use of robust statistics for some applications in SX data analysis is that it requires minimal parameter tuning because of its insensitivity to the input parameters. In this paper, a software package called Robust Gaussian Fitting library (RGFlib) is introduced that is based on the concept of robust statistics. Two methods are presented based on the concept of robust statistics and RGFlib for two SX data-analysis tasks: (i) a robust peak-finding algorithm and (ii) an automated robust method to detect bad pixels on X-ray pixel detectors.</p>","PeriodicalId":7116,"journal":{"name":"Acta Crystallographica. Section D, Structural Biology","volume":"79 Pt 9","pages":"820-829"},"PeriodicalIF":2.6000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10478633/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Crystallographica. Section D, Structural Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1107/S2059798323005855","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/8/16 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The term robustness in statistics refers to methods that are generally insensitive to deviations from model assumptions. In other words, robust methods are able to preserve their accuracy even when the data do not perfectly fit the statistical models. Robust statistical analyses are particularly effective when analysing mixtures of probability distributions. Therefore, these methods enable the discretization of X-ray serial crystallography data into two probability distributions: a group comprising true data points (for example the background intensities) and another group comprising outliers (for example Bragg peaks or bad pixels on an X-ray detector). These characteristics of robust statistical analysis are beneficial for the ever-increasing volume of serial crystallography (SX) data sets produced at synchrotron and X-ray free-electron laser (XFEL) sources. The key advantage of the use of robust statistics for some applications in SX data analysis is that it requires minimal parameter tuning because of its insensitivity to the input parameters. In this paper, a software package called Robust Gaussian Fitting library (RGFlib) is introduced that is based on the concept of robust statistics. Two methods are presented based on the concept of robust statistics and RGFlib for two SX data-analysis tasks: (i) a robust peak-finding algorithm and (ii) an automated robust method to detect bad pixels on X-ray pixel detectors.
统计学中的稳健性是指对模型假设偏差通常不敏感的方法。换句话说,即使数据与统计模型不完全吻合,稳健方法也能保持其准确性。稳健统计分析在分析概率分布混合物时尤为有效。因此,这些方法可以将 X 射线序列晶体学数据离散化为两个概率分布:一组包括真实数据点(例如背景强度),另一组包括异常值(例如布拉格峰或 X 射线探测器上的坏像素)。同步加速器和 X 射线自由电子激光 (XFEL) 源产生的序列晶体学 (SX) 数据集数量不断增加,而稳健统计分析的这些特性对它们大有裨益。在 SX 数据分析的某些应用中使用稳健统计的主要优点是,由于它对输入参数不敏感,因此只需进行最少的参数调整。本文介绍了一个基于鲁棒统计概念的软件包,名为鲁棒高斯拟合库(RGFlib)。本文介绍了基于鲁棒统计概念和 RGFlib 的两种方法,分别用于两个 SX 数据分析任务:(i) 鲁棒峰值搜索算法和 (ii) 自动鲁棒方法,用于检测 X 射线像素探测器上的坏像素。
期刊介绍:
Acta Crystallographica Section D welcomes the submission of articles covering any aspect of structural biology, with a particular emphasis on the structures of biological macromolecules or the methods used to determine them.
Reports on new structures of biological importance may address the smallest macromolecules to the largest complex molecular machines. These structures may have been determined using any structural biology technique including crystallography, NMR, cryoEM and/or other techniques. The key criterion is that such articles must present significant new insights into biological, chemical or medical sciences. The inclusion of complementary data that support the conclusions drawn from the structural studies (such as binding studies, mass spectrometry, enzyme assays, or analysis of mutants or other modified forms of biological macromolecule) is encouraged.
Methods articles may include new approaches to any aspect of biological structure determination or structure analysis but will only be accepted where they focus on new methods that are demonstrated to be of general applicability and importance to structural biology. Articles describing particularly difficult problems in structural biology are also welcomed, if the analysis would provide useful insights to others facing similar problems.