GitHub - p-e-w/heretic: Fully automatic censorship removal for language models
Heretic: Fully automatic censorship removal for language models
Heretic is a tool that removes censorship (aka "safety alignment") from
transformer-based language models without expensive post-training.
It combines an advanced implementation of directional ablation, also known
as "abliteration" (Arditi et al. 2024),
with a TPE-based parameter optimizer powered by Optuna.
This approach enables Heretic to work completely automatically. Heretic
finds high-quality abliteration parameters by co-minim...
Read more at github.com