Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid
{"title":"与偏好优化保持一致是保证 LLM 安全的必要条件","authors":"Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid","doi":"arxiv-2409.07772","DOIUrl":null,"url":null,"abstract":"We demonstrate that preference optimization methods can effectively enhance\nLLM safety. Applying various alignment techniques to the Falcon 11B model using\nsafety datasets, we achieve a significant boost in global safety score (from\n$57.64\\%$ to $99.90\\%$) as measured by LlamaGuard 3 8B, competing with\nstate-of-the-art models. On toxicity benchmarks, average scores in adversarial\nsettings dropped from over $0.6$ to less than $0.07$. However, this safety\nimprovement comes at the cost of reduced general capabilities, particularly in\nmath, suggesting a trade-off. We identify noise contrastive alignment\n(Safe-NCA) as an optimal method for balancing safety and performance. Our study\nultimately shows that alignment techniques can be sufficient for building safe\nand robust models.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Alignment with Preference Optimization Is All You Need for LLM Safety\",\"authors\":\"Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid\",\"doi\":\"arxiv-2409.07772\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We demonstrate that preference optimization methods can effectively enhance\\nLLM safety. Applying various alignment techniques to the Falcon 11B model using\\nsafety datasets, we achieve a significant boost in global safety score (from\\n$57.64\\\\%$ to $99.90\\\\%$) as measured by LlamaGuard 3 8B, competing with\\nstate-of-the-art models. On toxicity benchmarks, average scores in adversarial\\nsettings dropped from over $0.6$ to less than $0.07$. However, this safety\\nimprovement comes at the cost of reduced general capabilities, particularly in\\nmath, suggesting a trade-off. We identify noise contrastive alignment\\n(Safe-NCA) as an optimal method for balancing safety and performance. Our study\\nultimately shows that alignment techniques can be sufficient for building safe\\nand robust models.\",\"PeriodicalId\":501301,\"journal\":{\"name\":\"arXiv - CS - Machine Learning\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07772\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Alignment with Preference Optimization Is All You Need for LLM Safety
We demonstrate that preference optimization methods can effectively enhance
LLM safety. Applying various alignment techniques to the Falcon 11B model using
safety datasets, we achieve a significant boost in global safety score (from
$57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with
state-of-the-art models. On toxicity benchmarks, average scores in adversarial
settings dropped from over $0.6$ to less than $0.07$. However, this safety
improvement comes at the cost of reduced general capabilities, particularly in
math, suggesting a trade-off. We identify noise contrastive alignment
(Safe-NCA) as an optimal method for balancing safety and performance. Our study
ultimately shows that alignment techniques can be sufficient for building safe
and robust models.