Developer builds 25.6M parameter Rust language model from scratch using hybrid attention architecture, achieving 51x inference speedup to 286.6 tokens/sec on RTX 4060 Ti by combining local windowed attention with GRU-like recurrent state and 8-bit KV cache compression.

Attention Is All You Need, but All You Can't Afford – Hybrid Attention

I have been building a small Rust focused language model from scratch in PyTorch. This is not a finetune. It is byte level, trained from random initialization on a Rust heavy corpus assembled here: https://codeberg.org/JohannaJuntos/SisyphusModel and training setupThe model has 25.6M parameters with a 512 context length. It uses a byte level vocabulary of 256, with 8 layers, 8 heads, and 512 dimensional embeddings. Positional embeddings are learned and the embedding and LM head weights are tied....