News Score: Score the News, Sort the News, Rewrite the Headlines

Attention Is All You Need, but All You Can't Afford – Hybrid Attention

I have been building a small Rust focused language model from scratch in PyTorch. This is not a finetune. It is byte level, trained from random initialization on a Rust heavy corpus assembled here: https://codeberg.org/JohannaJuntos/SisyphusModel and training setupThe model has 25.6M parameters with a 512 context length. It uses a byte level vocabulary of 256, with 8 layers, 8 heads, and 512 dimensional embeddings. Positional embeddings are learned and the embedding and LM head weights are tied....

Read more at news.ycombinator.com

© News Score  score the news, sort the news, rewrite the headlines