Tanny Chavez, Zhuowen Zhao, Runbo Jiang, Wiebke Koepp, Dylan McReynolds, Petrus H Zwart, Daniel B Allan, Eliot H Gann, Nicholas Schwarz, Daniela Ushizima, Edward S Barnard, Apurva Mehta, Subramanian Sankaranarayanan, Alexander Hexemer
{"title":"MLExchange中用于科学分析的机器学习驱动的数据标记管道。","authors":"Tanny Chavez, Zhuowen Zhao, Runbo Jiang, Wiebke Koepp, Dylan McReynolds, Petrus H Zwart, Daniel B Allan, Eliot H Gann, Nicholas Schwarz, Daniela Ushizima, Edward S Barnard, Apurva Mehta, Subramanian Sankaranarayanan, Alexander Hexemer","doi":"10.1107/S1600576725002328","DOIUrl":null,"url":null,"abstract":"<p><p>This study introduces a novel labeling pipeline to accelerate the labeling process of scientific data sets by using artificial intelligence (AI)-guided tagging techniques. This pipeline includes a set of interconnected web-based graphical user interfaces (GUIs), where <i>Data Clinic</i> and <i>MLCoach</i> enable the preparation of machine learning (ML) models for data reduction and classification, respectively, while <i>Label Maker</i> is used for label assignment. Throughout this pipeline, data can be accessed through a direct connection to a file system or through <i>Tiled</i> for access through Hypertext Transfer Protocol (HTTP). Our experimental results present three use cases where this labeling pipeline has been instrumental for the study of large X-ray scattering data sets in the area of pattern recognition, the remote analysis of resonant soft X-ray scattering data and the fine-tuning process of foundation models. These use cases highlight the labeling capabilities of this pipeline, including the ability to label large data sets in a short period of time, to perform remote data analysis while minimizing data movement and to enhance the fine-tuning process of complex ML models with human involvement.</p>","PeriodicalId":14950,"journal":{"name":"Journal of Applied Crystallography","volume":"58 Pt 3","pages":"731-745"},"PeriodicalIF":2.8000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12135984/pdf/","citationCount":"0","resultStr":"{\"title\":\"A machine-learning-driven data labeling pipeline for scientific analysis in <i>MLExchange</i>.\",\"authors\":\"Tanny Chavez, Zhuowen Zhao, Runbo Jiang, Wiebke Koepp, Dylan McReynolds, Petrus H Zwart, Daniel B Allan, Eliot H Gann, Nicholas Schwarz, Daniela Ushizima, Edward S Barnard, Apurva Mehta, Subramanian Sankaranarayanan, Alexander Hexemer\",\"doi\":\"10.1107/S1600576725002328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This study introduces a novel labeling pipeline to accelerate the labeling process of scientific data sets by using artificial intelligence (AI)-guided tagging techniques. This pipeline includes a set of interconnected web-based graphical user interfaces (GUIs), where <i>Data Clinic</i> and <i>MLCoach</i> enable the preparation of machine learning (ML) models for data reduction and classification, respectively, while <i>Label Maker</i> is used for label assignment. Throughout this pipeline, data can be accessed through a direct connection to a file system or through <i>Tiled</i> for access through Hypertext Transfer Protocol (HTTP). Our experimental results present three use cases where this labeling pipeline has been instrumental for the study of large X-ray scattering data sets in the area of pattern recognition, the remote analysis of resonant soft X-ray scattering data and the fine-tuning process of foundation models. These use cases highlight the labeling capabilities of this pipeline, including the ability to label large data sets in a short period of time, to perform remote data analysis while minimizing data movement and to enhance the fine-tuning process of complex ML models with human involvement.</p>\",\"PeriodicalId\":14950,\"journal\":{\"name\":\"Journal of Applied Crystallography\",\"volume\":\"58 Pt 3\",\"pages\":\"731-745\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12135984/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Crystallography\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://doi.org/10.1107/S1600576725002328\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Biochemistry, Genetics and Molecular Biology\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Crystallography","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.1107/S1600576725002328","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
引用次数: 0
摘要
本研究引入了一种新的标注管道,利用人工智能(AI)引导的标注技术来加速科学数据集的标注过程。该管道包括一组互连的基于web的图形用户界面(gui),其中Data Clinic和MLCoach分别为数据简化和分类准备机器学习(ML)模型,而Label Maker用于标签分配。在整个管道中,可以通过直接连接到文件系统或通过tile通过超文本传输协议(Hypertext Transfer Protocol, HTTP)访问数据。我们的实验结果提供了三个用例,其中该标记管道已用于模式识别领域的大型x射线散射数据集的研究,共振软x射线散射数据的远程分析以及基础模型的微调过程。这些用例突出了该管道的标记功能,包括在短时间内标记大型数据集的能力,在最大限度地减少数据移动的同时执行远程数据分析的能力,以及在人工参与的情况下增强复杂ML模型的微调过程。
A machine-learning-driven data labeling pipeline for scientific analysis in MLExchange.
This study introduces a novel labeling pipeline to accelerate the labeling process of scientific data sets by using artificial intelligence (AI)-guided tagging techniques. This pipeline includes a set of interconnected web-based graphical user interfaces (GUIs), where Data Clinic and MLCoach enable the preparation of machine learning (ML) models for data reduction and classification, respectively, while Label Maker is used for label assignment. Throughout this pipeline, data can be accessed through a direct connection to a file system or through Tiled for access through Hypertext Transfer Protocol (HTTP). Our experimental results present three use cases where this labeling pipeline has been instrumental for the study of large X-ray scattering data sets in the area of pattern recognition, the remote analysis of resonant soft X-ray scattering data and the fine-tuning process of foundation models. These use cases highlight the labeling capabilities of this pipeline, including the ability to label large data sets in a short period of time, to perform remote data analysis while minimizing data movement and to enhance the fine-tuning process of complex ML models with human involvement.
期刊介绍:
Many research topics in condensed matter research, materials science and the life sciences make use of crystallographic methods to study crystalline and non-crystalline matter with neutrons, X-rays and electrons. Articles published in the Journal of Applied Crystallography focus on these methods and their use in identifying structural and diffusion-controlled phase transformations, structure-property relationships, structural changes of defects, interfaces and surfaces, etc. Developments of instrumentation and crystallographic apparatus, theory and interpretation, numerical analysis and other related subjects are also covered. The journal is the primary place where crystallographic computer program information is published.