GitHub releases Heretic - tool automatically removes safety alignment from AI language models using directional ablation; decensors models in 45 minutes without human expertise

GitHub - p-e-w/heretic: Fully automatic censorship removal for language models

Heretic: Fully automatic censorship removal for language models Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024), with a TPE-based parameter optimizer powered by Optuna. This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minim...