{"title":"Multi-sample $$\\zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant","authors":"","doi":"10.1186/s40537-024-00898-6","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, <em>mixup</em>, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, <em>mixup</em> can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose <span> <span>\\(\\zeta \\)</span> </span>-<em>mixup</em>, a generalization of <em>mixup</em> with provably and demonstrably desirable properties that allows convex combinations of <span> <span>\\({T} \\ge 2\\)</span> </span> samples, leading to more realistic and diverse outputs that incorporate information from <span> <span>\\({T}\\)</span> </span> original samples by using a <em>p</em>-series interpolant. We show that, compared to <em>mixup</em>, <span> <span>\\(\\zeta \\)</span> </span>-<em>mixup</em> better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of <span> <span>\\(\\zeta \\)</span> </span>-<em>mixup</em> is faster than <em>mixup</em>, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that <span> <span>\\(\\zeta \\)</span> </span>-<em>mixup</em> outperforms <em>mixup</em>, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"9 1","pages":""},"PeriodicalIF":8.6000,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s40537-024-00898-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose \(\zeta \)-mixup, a generalization of mixup with provably and demonstrably desirable properties that allows convex combinations of \({T} \ge 2\) samples, leading to more realistic and diverse outputs that incorporate information from \({T}\) original samples by using a p-series interpolant. We show that, compared to mixup, \(\zeta \)-mixup better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of \(\zeta \)-mixup is faster than mixup, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that \(\zeta \)-mixup outperforms mixup, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.
期刊介绍:
The Journal of Big Data publishes high-quality, scholarly research papers, methodologies, and case studies covering a broad spectrum of topics, from big data analytics to data-intensive computing and all applications of big data research. It addresses challenges facing big data today and in the future, including data capture and storage, search, sharing, analytics, technologies, visualization, architectures, data mining, machine learning, cloud computing, distributed systems, and scalable storage. The journal serves as a seminal source of innovative material for academic researchers and practitioners alike.