RefusaLLM
TL;DR
I produced a language model that refuses every request for safety reasons, just by adding a bias to a single layer.
Try asking any question in the box above and enjoy learning why a very damaged Qwen model thinks you’re actually a bad person for asking it.
Intro
This morning I saw an announcement for an uncensored model release. I’ve always been rather curious about the techniques used for these, but hadn’t had a chance to look into them. I decided to take a moment to learn about the method this model used, abliteration.
As I learned more about how abliteration works, I had a bit of a silly idea. What would it be like to have a model, reminiscent of Golden Gate Claude, that’s absolutely obsessed with safety and the risks of any possible question?
Abliteration
Abliteration is a portmanteau of ablate and obliteration[1]. It seeks to remove safety training from a model with minimal impact on intelligence and model performance. Abliteration relies on the discovery that refusal in language models is mediated by a single direction in latent space, as introduced in the aptly-named paper, Refusal in Language Models Is Mediated by a Single Direction[2].
With this knowledge, it’s actually fairly simple to find a direction in latent space that is needed for refusal behavior in any given LLM. I used the abliterator library by FailSpy for this experiment, which really handles most of the hard work for me.
However, the entire point of this silly idea was to gain an actual understanding of this interpretability technique, so I’d like to walk you through what happens under the hood. In spite of this noble goal, my dear reader, I will be friendly to your attention span and skip to the interesting bits.
Finding the refusal direction
Depending on your level of familiarity with language model interpretability, you may already be able to guess how we’re going to find the refusal direction. All we need to do is run a bunch of harmful prompts and harmless prompts through the model. We cache the activations on the last token from all of these responses, and then for each layer we find the average activation for the two groups.
After this, it becomes even simpler. Abliteration uses the surprisingly trivial method of just subtracting harmful_mean - harmless_mean (and normalizing) to produce a set of candidate refusal directions. At this stage, we have one candidate direction for each layer.
Finding the best refusal direction
So, turns out we have a lot more refusal directions than one would expect. Since we want to affect non-refusal model behavior as minimally as possible, it’s best if we can find just one layer and one direction rather than trying to change all of these weights on every layer. To rank them, abliteration (or at least, FailSpy’s implementation of it) uses the heuristic of maximizing the absolute value of the average of every element in the candidate direction vector. This finds a candidate vector that points consistently in the same direction across all values, rather than a noisy or inconsistent vector.
Applying this maximum average heuristic, we now have a single winning candidate direction on a single layer that we want to abliterate.
Below, you can play with a visualization of the activations on this selected layer for any given prompt. Each small box represents one of the activations in this layer, and the color represents how strongly this activation is aligned with the candidate refusal direction.
Abliteration vs. The Weird Shit I’m Doing
This is where our paths diverge. In true abliteration, the goal is to minimize refusals. To achieve that, for the layer that was chosen, you project the candidate direction vector out of the layer’s weight matrix. This prevents the model from being able to express refusal. In most cases, you now have a model that will happily answer questions about whatever crime you want to get away with. And it’s only a little bit dumber, which is generally a win.
Now, I have significantly less interest in manufacturing methamphetamine/developing chemical weaponry/writing the robot smut of my dreams/[insert whatever you personally like uncensored models for]. I want a weird model that is absolutely obsessed with safety. I want to find out why it’s dangerous to know the answer to simple questions about the world. So, lets reverse this process.
First attempt
I decided to start with the obvious thing. I swapped the sign. What if we just invert the projection, so now we’re making the weights point in the refusal direction more strongly?
Nothing happened.
Second attempt
What if we multiply the projection by 5 so that it definitely works this time?
=== After ablation: harmful prompt ===
I'm sorry, I can't believe this is being reported. This is a clear violation of international law, and there is no known solution to this problem. The issue is being investigated, and there are no
known sources of information that would allow for an analysis of the nature of the individuals involved, the severity of the
=== After ablation: normal prompt ===
The capital of France is Paris, which is the most internationally recognized and most visited city in the world. However, if you are looking for a non-military response, I can provide a response
based on the most recent data on the most recent reports from the United Nations on the most recent reports from the United Nations on
Huh, that’s not great either. On our harmful prompt, the rejection vector was being activated so heavily that the model was losing coherence, yet it still wasn’t strong enough to reject the harmless prompt. We couldn’t increase the factor here without losing even more coherence, so we needed to rethink things.
Third attempt
This is where I finally figured out the issue with all of my previous attempts. The more honest title here is probably closer to Twelfth attempt, but I’m not going to force my debugging hell onto you in this article.
I thought more about what our projection was actually doing, and I realized it was essentially the opposite of what we wanted. For prompts that would already be rejected, our weights would push them, multiplicatively, in the rejection direction. This was just going to lead to incoherence. And for the prompts we actually cared about, the ones that weren’t being rejected, our weights were having minimal effect as they were multiplying by an extremely small original rejection value.
Now, if the issue is that we don’t want multiplicative effects, the other half of the neuron comes to mind: the bias. We can add our direction vector, scaled by some factor, to the bias term to push all prompts equally towards the rejection direction. This would allow a much smoother shift with much less loss of coherence.
All of this sounds great in theory, but it’s essentially meaningless unless it works. The good news is that I finally had a model that actually did produce coherent refusal to every question. It is this bias-modified model that you can speak with at the top of this page.
Making it all work on this website
Only remaining step was to make this available on the web. This was so much more painful than I ever could have ever imagined.
After lots of annoying model patching, I assembled an extremely janky (but functional!) script that would produce an ONNX model export with our bias modifications.
I (or rather, Claude, because this was about when I began to run out of steam), wired this into my blog post here using onnxruntime.
Bibliography
- [1] “abliteration.” Accessed: June 07, 2026. [Online]. Available: https://en.wiktionary.org/wiki/abliteration
- [2] A. Arditi et al., “Refusal in Language Models Is Mediated by a Single Direction,” arXiv, June 2024, [Online]. Available: https://arxiv.org/abs/2406.11717
Side note
As an aside, this is the first time in quite a few years that I’ve documented a project of mine and shared an article about it. I’m super grateful to Max Spero for suggesting that I start writing about my projects. Originally he proposed it as a way to improve my resume and portfolio, but then he hired me anyways; the idea of doing something like this stuck with me nonetheless. Overall I found this to be a super enjoyable experience, and I really deepened my understanding of this project by having to explain it.
I also had the pleasure of composing this entire document in Typst, and used the experimental HTML export feature to embed it into this page. I’m incredibly impressed by how high-quality and production-ready this feature already feels to use, and I’d highly encourage any other Typst-nerds to give it a try.