Whats better: Neural nets wider with less layers or thinner with more layers
Introduction
Theory
Results
conclusion
citations
Introduction
This post details my experiments on whether Transformers with more thin layers are better than Transformers with fewer wide layers. I tested 5 different configurations to conclude that an optimal ratio between is the best config, and in my experiments, 4 layers with an embd_dim of 1024 worked the best.
I'm basing the layer width off of Wide Attention is the way forward for Transformers, which widens the layer through the FFN width, wh...
Read more at vatsadev.github.io