"ResNet Matches ViT and ConvNeXt Performance on Image Recognition Tasks after Key Modifications, Disproves Claims about ConvNets Being Obsolete"

Vision Transformers are Overrated

Vision transformers (ViTs) have seen an incredible rise in the past four years. They have an obvious upside: in a visual recognition setting, the receptive field of a pure ViT is effectively the entire image 1. In particular, vanilla ViTs maintain the quadratic time complexity (w.r.t. number of input patches) of language models with dense attention. Kernels in convolutional networks, on the other hand, have the property of being invariant to the input pixel/voxel that it is applied to, a feature...