Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Summary
This academic survey examines harmful fine-tuning attacks (methods where attackers modify an AI model's training process to make it behave dangerously) and the defenses designed to stop them. The paper reviews different types of attacks, how they work, and various protection strategies researchers have developed to keep large language models safe from this threat.
Classification
Related Issues
Original source: https://dl.acm.org/doi/abs/10.1145/3817114?af=R
First tracked: June 24, 2026 at 08:01 AM
Classified by LLM (prompt v3) · confidence: 92%