I don't know if its still comedy or has now reached the stage of farce, but I st...

mlin4589 · 2025-05-07T18:32:08 1746642728

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.