Cervical OCT image classification using contrastive masked autoencoders with Swin Transformer

Qingbin Wang; Yuxuan Xiong; Hanfeng Zhu; Xuefeng Mu; Yan Zhang; Yutao Ma

doi:10.1016/j.compmedimag.2024.102469

Cervical OCT image classification using contrastive masked autoencoders with Swin Transformer

Comput Med Imaging Graph. 2024 Nov 19:118:102469. doi: 10.1016/j.compmedimag.2024.102469. Online ahead of print.

Authors

Qingbin Wang¹, Yuxuan Xiong¹, Hanfeng Zhu², Xuefeng Mu³, Yan Zhang³, Yutao Ma⁴

Affiliations

¹ School of Computer Science, Wuhan University, Wuhan, 430072, China.
² School of Computer Science & Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, 430079, China.
³ Department of Obstetrics and Gynecology, Remin Hospital of Wuhan University, Wuhan, 430060, China.
⁴ School of Computer Science & Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, 430079, China. Electronic address: ytma@ccnu.edu.cn.

PMID: 39577206
DOI: 10.1016/j.compmedimag.2024.102469

Abstract

Background and objective: Cervical cancer poses a major health threat to women globally. Optical coherence tomography (OCT) imaging has recently shown promise for non-invasive cervical lesion diagnosis. However, obtaining high-quality labeled cervical OCT images is challenging and time-consuming as they must correspond precisely with pathological results. The scarcity of such high-quality labeled data hinders the application of supervised deep-learning models in practical clinical settings. This study addresses the above challenge by proposing CMSwin, a novel self-supervised learning (SSL) framework combining masked image modeling (MIM) with contrastive learning based on the Swin-Transformer architecture to utilize abundant unlabeled cervical OCT images.

Methods: In this contrastive-MIM framework, mixed image encoding is combined with a latent contextual regressor to solve the inconsistency problem between pre-training and fine-tuning and separate the encoder's feature extraction task from the decoder's reconstruction task, allowing the encoder to extract better image representations. Besides, contrastive losses at the patch and image levels are elaborately designed to leverage massive unlabeled data.

Results: We validated the superiority of CMSwin over the state-of-the-art SSL approaches with five-fold cross-validation on an OCT image dataset containing 1,452 patients from a multi-center clinical study in China, plus two external validation sets from top-ranked Chinese hospitals: the Huaxi dataset from the West China Hospital of Sichuan University and the Xiangya dataset from the Xiangya Second Hospital of Central South University. A human-machine comparison experiment on the Huaxi and Xiangya datasets for volume-level binary classification also indicates that CMSwin can match or exceed the average level of four skilled medical experts, especially in identifying high-risk cervical lesions.

Conclusion: Our work has great potential to assist gynecologists in intelligently interpreting cervical OCT images in clinical settings. Additionally, the integrated GradCAM module of CMSwin enables cervical lesion visualization and interpretation, providing good interpretability for gynecologists to diagnose cervical diseases efficiently.

Keywords: Cervical cancer; Image classification; Interpretability; Optical coherence tomography; Self-supervised learning; Swin Transformer.