"Optimal Balance of Layer Width and Depth Key to Transformer Efficiency: Study Reveals 4 Layers with an Embd

"Optimal Balance of Layer Width and Depth Key to Transformer Efficiency: Study Reveals 4 Layers with an Embd_dim of 1024 as Optimal Configuration"

Whats better: Neural nets wider with less layers or thinner with more layers

Introduction Theory Results conclusion citations Introduction This post details my experiments on whether Transformers with more thin layers are better than Transformers with fewer wide layers. I tested 5 different configurations to conclude that an optimal ratio between is the best config, and in my experiments, 4 layers with an embd_dim of 1024 worked the best. I'm basing the layer width off of Wide Attention is the way forward for Transformers, which widens the layer through the FFN width, wh...