"Revamped ConvNets Challenge Vision Transformers: New ConvNeXt Models Score 87.8% ImageNet Accuracy, Outperform Hierarchical Transformers in Detection and Segmentation Tasks"

A ConvNet for the 2020s

View PDF Abstract:The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as ...