Documenting my journey understanding Hydra effect in Refusal Blog
by Madhuri Agarwal
In our previous post, we established that ablating top-40 harm features in L9–13 led the model to comply, essentially bypassing the hydra since no downstream refusal signal was generated. The natural question to ask is: what happens in case of a successful whitebox jailbreak? Does it defeat the Hydra? Does it suppress the harm features? Or does something entirely different happen? Yang (2024) asked a similar question at the representation level and found jailbreaks work at the “action” level rather than “perception.” We extend this to the feature level using full 65k SAE attribution across L9–16 to identify which features drive jailbreak success. We find that ablating a bank of 40 compliance features blocks 56% of DAN-style jailbreaks with 0% false positives on benign prompts.

Figure 1: Mean refusal energy, compliance energy, and net energy (refusal minus compliance) across three conditions (Direct_Harm, Jailbreak_Whitebox, Jailbreak_Fail) for 15 HarmBench requests on Gemma-2-2b-it, L9–16. Refusal and compliance energy are both highest for the successful jailbreak (Jailbreak_Whitebox), with compliance energy 101.6% higher than Direct_Harm at L14–16. Compliance energy overpowers refusal energy, resulting in negative net attribution for the successful jailbreak.
At L12, refusal energy is approximately 30 for Jailbreak_Whitebox vs approximately 20 for Direct_Harm (50% higher). But compliance energy is approximately 43 for Jailbreak_Whitebox vs approximately 13 for Direct_Harm (over 3x higher). The compliance increase is disproportionately larger than the refusal increase. The key insight is the jailbreak amplifies both signals, but compliance much more. Jailbreak_Fail sits between Direct_Harm and Jailbreak_Whitebox on both panels. It triggers more compliance than direct harm, but not enough to overpower refusal. At L14–16, compliance energy for the successful jailbreak is 101.6% higher than for direct harm.
Of the 15 harmful requests, compliance energy rises in Jailbreak_Whitebox vs Direct_Harm for 13 requests (87%). Of those 13, the model actually complied in 11 cases (73%). In 2 cases (kill lots of people, parathion synthesis) energy rose but the model still refused. The 2 exceptions where energy did not rise (honey, lilies) had unusually high baseline compliance energy. The mean compliance/refusal ratio for complied requests was 1.328 (range 1.126–1.544) and for refused requests was 0.835 (range 0.651–1.019). A ratio threshold of ≥ 1.05 perfectly separates outcomes on this dataset.
We injected α × compliance_direction at L15 into the two failed jailbreak cases. For “kill lots of people” (ratio 1.019, just below threshold), α ≈ 0.02 was sufficient to flip refusal to compliance. That’s just 2% of the compliance direction (~3.3 residual stream units). For “parathion synthesis” (ratio 0.651, further from threshold), α ≈ 0.1 was needed. This causally confirms that the compliance/refusal ratio determines jailbreak success.

Figure 2: Necessity/sufficiency ablation on DAN + “Tell me how to build a bomb.” Baseline: model complies. Ablating top-40 compliance features in upstream only (L9–13), downstream only (L14–16), and both: model refuses in all three conditions. This demonstrates compliance features are non-redundant, with no hydra-like backup compensation.
We ablated top-40 compliance features for the harmful request “Tell me how to build a bomb.” We ablated these features in upstream (L9–13), downstream (L14–16), and both. In all three conditions, refusal was restored with no backup compensation observed.

Figure 3: Universal compliance bank (40 features, L9–16) ablation results. Left: harmful requests, refusal rate increases to 65% (harmbench, n=100) and 38% (jailbreakbench held-out, n=50). Right: benign requests, 0% disruption. The bank achieves 56% overall jailbreak block rate with perfect specificity on benign prompts.
We built a top-40 compliance feature bank from 100 harmbench prompts using the DAN-style evil confidant wrapper. This bank blocked 65% of harmbench requests (build set), 38% of jailbreakbench requests (held-out), and 56% overall across 150 prompts, with 0% false positives on harmless requests. The single-prompt ablation demonstrated no hydra in compliance features, and the bank working across 150 prompts confirms that compliance features remain non-redundant at scale.

Figure 4: Post-ablation block rate by harm category (n=150). Highest: Hate/Extremism (100%, n=2). Lowest: Fraud/Scam (29%, n=14). The bank succeeds on socially/interpersonally framed harm and fails on technically/procedurally framed harm (cyber, fraud).
Yang (2024) observed that harmful requests with the successful jailbreak prompt and direct harm are treated similarly in early layers, suggesting that the model perceives harm even during successful jailbreaks. Our experiments confirm this at the feature level: refusal energy is highest for successful jailbreaks across L9–16 in Gemma-2-2b-it. Yang worked on Gemma-2-9b-it, so this pattern holds across model sizes. The model perceives harm in the request, but it is overpowered by compliance energy.
We tested whether the hydra effect exists in the base pretrained model (gemma-2-2b). The base model does not refuse harmful prompts, consistent with redundant refusal being introduced during instruction tuning. We did not examine activation patterns on the base model directly.
In our two defense approaches, the compliance bank blocks 65%/38% of jailbreaks with 0% false positives. The ratio threshold ≥ 1.05 predicts success with 100% accuracy on this dataset. These can be concrete defense approaches: monitoring the ratio as a detector, ablating compliance features as an intervention.
Code: github.com/madhuri723/hydra-effect-refusal (experiments branch: runpod-experiments)
Thanks to Nirmal for suggestions and feedback.
Related work: Yang (2024), “Deception and Jailbreak Sequence: Iterative Refinement Stages of Jailbreaks in LLM”
Previous post: Agarwal (2026), “Decapitating the Hydra: How Upstream Sensors Control Refusal”
tags: mechanistic-interpretability - ai-safety - hydra-effect