Alignment with Preference Optimization Is All You Need for LLM Safety

arXiv - CS - Machine Learning Pub Date : 2024-09-12 DOI:arxiv-2409.07772

Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid

引用次数: 0

Abstract

We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

查看原文本刊更多论文

与偏好优化保持一致是保证 LLM 安全的必要条件

我们证明了偏好优化方法可以有效提高LLM 的安全性。在使用安全数据集的 Falcon 11B 模型中应用各种配准技术后，我们显著提高了 LlamaGuard 3 8B 测定的全球安全得分（从 57.64%$ 提高到 99.90%$），与最先进的模型不相上下。在毒性基准上，对抗环境中的平均得分从 0.6 美元以上降至 0.07 美元以下。然而，这种安全性的提高是以通用能力的降低为代价的，特别是在数学方面，这表明需要权衡利弊。我们认为噪声对比对齐（Safe-NCA）是平衡安全性和性能的最佳方法。我们的研究最终表明，配准技术足以构建安全而稳健的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Machine Learning

自引率

0.00%

发文量