ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to + 1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at https://tue-mps.github.io/ALGM.
Original languageEnglish
Title of host publication2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherInstitute of Electrical and Electronics Engineers
Pages15773-15782
Number of pages10
ISBN (Electronic)979-8-3503-5300-6
DOIs
Publication statusPublished - 16 Sept 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 17 Jun 202421 Jun 2024

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Abbreviated titleCVPRW 2024
Country/TerritoryUnited States
CitySeattle
Period17/06/2421/06/24

Keywords

  • Computer vision
  • Adaptation models
  • Codes
  • Adaptive systems
  • Semantic segmentation
  • Computational modeling
  • Merging
  • Semantic Segmentation
  • Token Merging
  • Efficient Vision Transformers

Fingerprint

Dive into the research topics of 'ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers'. Together they form a unique fingerprint.

Cite this