Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid
{"title":"Alignment with Preference Optimization Is All You Need for LLM Safety","authors":"Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid","doi":"arxiv-2409.07772","DOIUrl":null,"url":null,"abstract":"We demonstrate that preference optimization methods can effectively enhance\nLLM safety. Applying various alignment techniques to the Falcon 11B model using\nsafety datasets, we achieve a significant boost in global safety score (from\n$57.64\\%$ to $99.90\\%$) as measured by LlamaGuard 3 8B, competing with\nstate-of-the-art models. On toxicity benchmarks, average scores in adversarial\nsettings dropped from over $0.6$ to less than $0.07$. However, this safety\nimprovement comes at the cost of reduced general capabilities, particularly in\nmath, suggesting a trade-off. We identify noise contrastive alignment\n(Safe-NCA) as an optimal method for balancing safety and performance. Our study\nultimately shows that alignment techniques can be sufficient for building safe\nand robust models.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We demonstrate that preference optimization methods can effectively enhance
LLM safety. Applying various alignment techniques to the Falcon 11B model using
safety datasets, we achieve a significant boost in global safety score (from
$57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with
state-of-the-art models. On toxicity benchmarks, average scores in adversarial
settings dropped from over $0.6$ to less than $0.07$. However, this safety
improvement comes at the cost of reduced general capabilities, particularly in
math, suggesting a trade-off. We identify noise contrastive alignment
(Safe-NCA) as an optimal method for balancing safety and performance. Our study
ultimately shows that alignment techniques can be sufficient for building safe
and robust models.