Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency

Tim G.W. Boers (Corresponding author), Kiki N. Fockens, Joost A. van der Putten, Tim J.M. Jaspers, Carolus H.J. Kusters, Jelmer B. Jukema, Martijn R. Jong, Maarten R. Struyvenberg, Jeroen de Groof, Jacques J. Bergman, Peter H.N. de With, Fons van der Sommen

Research output: Contribution to journalArticleAcademicpeer-review

5 Citations (Scopus)
102 Downloads (Pure)

Abstract

Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to training from scratch, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images. To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation. The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks. Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images. Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights.

Original languageEnglish
Article number103298
Number of pages11
JournalMedical Image Analysis
Volume98
DOIs
Publication statusPublished - Dec 2024

Keywords

  • Endoscopic image analysis
  • GastroNet-5M
  • In-domain pre-training
  • Transfer learning
  • Humans
  • Image Processing, Computer-Assisted/methods
  • Endoscopy, Gastrointestinal
  • Image Interpretation, Computer-Assisted/methods
  • Deep Learning

Fingerprint

Dive into the research topics of 'Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency'. Together they form a unique fingerprint.

Cite this