Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte
{"title":"大规模预训练和微调,实现粒子物理中的高效射流分类","authors":"Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte","doi":"arxiv-2408.09343","DOIUrl":null,"url":null,"abstract":"This study introduces an innovative approach to analyzing unlabeled data in\nhigh-energy physics (HEP) through the application of self-supervised learning\n(SSL). Faced with the increasing computational cost of producing high-quality\nlabeled simulation samples at the CERN LHC, we propose leveraging large volumes\nof unlabeled data to overcome the limitations of supervised learning methods,\nwhich heavily rely on detailed labeled simulations. By pretraining models on\nthese vast, mostly untapped datasets, we aim to learn generic representations\nthat can be finetuned with smaller quantities of labeled data. Our methodology\nemploys contrastive learning with augmentations on jet datasets to teach the\nmodel to recognize common representations of jets, addressing the unique\nchallenges of LHC physics. Building on the groundwork laid by previous studies,\nour work demonstrates the critical ability of SSL to utilize large-scale\nunlabeled data effectively. We showcase the scalability and effectiveness of\nour models by gradually increasing the size of the pretraining dataset and\nassessing the resultant performance enhancements. Our results, obtained from\nexperiments on two datasets -- JetClass, representing unlabeled data, and Top\nTagging, serving as labeled simulation data -- show significant improvements in\ndata efficiency, computational efficiency, and overall performance. These\nfindings suggest that SSL can greatly enhance the adaptability of ML models to\nthe HEP domain. This work opens new avenues for the use of unlabeled data in\nHEP and contributes to a better understanding the potential of SSL for\nscientific discovery.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics\",\"authors\":\"Zihan Zhao, Farouk Mokhtar, Raghav Kansal, Haoyang Li, Javier Duarte\",\"doi\":\"arxiv-2408.09343\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study introduces an innovative approach to analyzing unlabeled data in\\nhigh-energy physics (HEP) through the application of self-supervised learning\\n(SSL). Faced with the increasing computational cost of producing high-quality\\nlabeled simulation samples at the CERN LHC, we propose leveraging large volumes\\nof unlabeled data to overcome the limitations of supervised learning methods,\\nwhich heavily rely on detailed labeled simulations. By pretraining models on\\nthese vast, mostly untapped datasets, we aim to learn generic representations\\nthat can be finetuned with smaller quantities of labeled data. Our methodology\\nemploys contrastive learning with augmentations on jet datasets to teach the\\nmodel to recognize common representations of jets, addressing the unique\\nchallenges of LHC physics. Building on the groundwork laid by previous studies,\\nour work demonstrates the critical ability of SSL to utilize large-scale\\nunlabeled data effectively. We showcase the scalability and effectiveness of\\nour models by gradually increasing the size of the pretraining dataset and\\nassessing the resultant performance enhancements. Our results, obtained from\\nexperiments on two datasets -- JetClass, representing unlabeled data, and Top\\nTagging, serving as labeled simulation data -- show significant improvements in\\ndata efficiency, computational efficiency, and overall performance. These\\nfindings suggest that SSL can greatly enhance the adaptability of ML models to\\nthe HEP domain. This work opens new avenues for the use of unlabeled data in\\nHEP and contributes to a better understanding the potential of SSL for\\nscientific discovery.\",\"PeriodicalId\":501065,\"journal\":{\"name\":\"arXiv - PHYS - Data Analysis, Statistics and Probability\",\"volume\":\"47 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Data Analysis, Statistics and Probability\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09343\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本研究介绍了一种通过应用自监督学习(SSL)来分析高能物理(HEP)中未标记数据的创新方法。面对欧洲核子研究中心大型强子对撞机(CERN LHC)制作高质量标签模拟样本的计算成本不断增加的问题,我们建议利用大量未标签数据来克服监督学习方法的局限性,因为监督学习方法严重依赖于详细的标签模拟。通过在这些庞大的、大部分尚未开发的数据集上预训练模型,我们的目标是学习通用表示法,然后再用较小数量的标注数据进行微调。我们的方法利用对比学习和喷流数据集上的增强来教模型识别喷流的常见表示,从而解决大型强子对撞机物理的独特挑战。在以往研究奠定的基础上,我们的工作展示了 SSL 有效利用大规模无标记数据的关键能力。我们通过逐步增加预训练数据集的规模和评估由此带来的性能提升,展示了我们模型的可扩展性和有效性。我们在两个数据集(代表无标签数据的 JetClass 和作为有标签模拟数据的 TopTagging)上的实验结果表明,数据效率、计算效率和整体性能都有显著提高。这些发现表明,SSL 可以大大提高 ML 模型在 HEP 领域的适应性。这项工作为在 HEP 中使用无标记数据开辟了新途径,有助于更好地了解 SSL 在科学发现方面的潜力。
Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics
This study introduces an innovative approach to analyzing unlabeled data in
high-energy physics (HEP) through the application of self-supervised learning
(SSL). Faced with the increasing computational cost of producing high-quality
labeled simulation samples at the CERN LHC, we propose leveraging large volumes
of unlabeled data to overcome the limitations of supervised learning methods,
which heavily rely on detailed labeled simulations. By pretraining models on
these vast, mostly untapped datasets, we aim to learn generic representations
that can be finetuned with smaller quantities of labeled data. Our methodology
employs contrastive learning with augmentations on jet datasets to teach the
model to recognize common representations of jets, addressing the unique
challenges of LHC physics. Building on the groundwork laid by previous studies,
our work demonstrates the critical ability of SSL to utilize large-scale
unlabeled data effectively. We showcase the scalability and effectiveness of
our models by gradually increasing the size of the pretraining dataset and
assessing the resultant performance enhancements. Our results, obtained from
experiments on two datasets -- JetClass, representing unlabeled data, and Top
Tagging, serving as labeled simulation data -- show significant improvements in
data efficiency, computational efficiency, and overall performance. These
findings suggest that SSL can greatly enhance the adaptability of ML models to
the HEP domain. This work opens new avenues for the use of unlabeled data in
HEP and contributes to a better understanding the potential of SSL for
scientific discovery.