Doorgaan naar hoofdnavigatie Doorgaan naar zoeken Ga verder naar hoofdinhoud

Your ViT is Secretly an Image Segmentation Model

Onderzoeksoutput: Hoofdstuk in Boek/Rapport/CongresprocedureConferentiebijdrageAcademicpeer review

13 Downloads (Pure)

Samenvatting

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-Only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4 × faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.
Originele taal-2Engels
Titel2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
UitgeverijInstitute of Electrical and Electronics Engineers
Pagina's25303-25313
Aantal pagina's11
ISBN van elektronische versie979-8-3315-4364-8
ISBN van geprinte versie979-8-3315-4365-5
DOI's
StatusGepubliceerd - 13 aug. 2025
Evenement2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - Nashville, TN, USA, Nashville, Verenigde Staten van Amerika
Duur: 11 jun. 202515 jun. 2025

Congres

Congres2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Verkorte titelCVPR 2025
Land/RegioVerenigde Staten van Amerika
StadNashville
Periode11/06/2515/06/25

Vingerafdruk

Duik in de onderzoeksthema's van 'Your ViT is Secretly an Image Segmentation Model'. Samen vormen ze een unieke vingerafdruk.

Citeer dit