RC-DETR: Improving DETRs in crowded pedestrian detection via rank-based contrastive learning

Feng Gao; Jiaxu Leng; Ji Gan; Xinbo Gao

doi:10.1016/j.neunet.2024.106911

RC-DETR: Improving DETRs in crowded pedestrian detection via rank-based contrastive learning

Neural Netw. 2024 Nov 25:182:106911. doi: 10.1016/j.neunet.2024.106911. Online ahead of print.

Authors

Feng Gao¹, Jiaxu Leng², Ji Gan¹, Xinbo Gao³

Affiliations

¹ Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
² Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China. Electronic address: lengjx@cqupt.edu.cn.
³ Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China. Electronic address: gaoxb@cqupt.edu.cn.

PMID: 39612687
DOI: 10.1016/j.neunet.2024.106911

Abstract

The variants of DEtection TRansformer (DETRs) have achieved impressive performance in general object detection. However, they suffer notable performance degradation in scenarios involving crowded pedestrian detection. This decline primarily occurs during the training phase, where DETRs are constrained solely by pedestrian labels. This limitation leads to the production of indistinguishable image features between visually similar pedestrians and background elements, resulting in incorrect detections. To address this issue, this paper introduces a rank-based contrastive learning method, which constructs an additional and specific constraint for each indistinguishable training sample to produce distinguishable image features. Unlike previous methods that rely solely on pedestrian labels to achieve a consistent confidence score, our approach relies on multiple constraints and aims to ensure the correct rank of detection results, with confidence scores of pedestrians consistently surpassing those of background elements. Specifically, we first filter out some training samples that could interfere with our delineation of indistinguishable and distinguishable training samples. Then, based on the confidence score rank, we divide the rest of the training samples into distinguishable positive and negative training samples and indistinguishable positive and negative training samples. Finally, we combine these training samples into multiple positive and negative pairs and utilize these sample pairs to train DETRs via contrastive learning. Our method can be plugged into any DETRs and does not increase any overhead on inference. Extensive experiments on three DETRs show that our method achieves superior performance. Especially on the Crowdhuman dataset, our method achieved the state-of-the-art 38.9% MR.

Keywords: Contrastive learning; Crowded pedestrian detection; DEtection TRansformer (DETR).