Novel LLM Technique Outperforms GPT-4 in Safety Classification; Uses Pruned Models, Fewer Examples

Lightweight Safety Classification Using Pruned Language Models

View PDF HTML (experimental) Abstract:In this paper, we introduce a novel technique for content safety and prompt injection classification for Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisticated language understanding of an LLM, our approach delivers s...