High-Precision Multi-Object Tracking in Satellite Videos via Pixel-Wise Adaptive Feature Enhancement

Sensors (Basel). 2024 Oct 9;24(19):6489. doi: 10.3390/s24196489.

Abstract

In this paper, we focus on the multi-target tracking (MOT) task in satellite videos. To achieve efficient and accurate tracking, we propose a transformer-distillation-based end-to-end joint detection and tracking (JDT) method. Specifically, (1) considering that targets in satellite videos usually have small scales and are shot from a bird's-eye view, we propose a pixel-wise transformer-based feature distillation module through which useful object representations are learned via pixel-wise distillation using a strong teacher detection network; (2) targets in satellite videos, such as airplanes, ships, and vehicles, usually have similar appearances, so we propose a temperature-controllable key feature learning objective function, and by highlighting the learning of similar features during distilling, the tracking accuracy for such objects can be further improved; (3) we propose a method that is based on an end-to-end network but simultaneously learns from a highly precise teacher network and tracking head during training so that the tracking accuracy of the end-to-end network can be improved via distillation without compromising efficiency. The experimental results on three recently released publicly available datasets demonstrated the superior performance of the proposed method for satellite videos. The proposed method achieved over 90% overall tracking performance on the AIR-MOT dataset.

Keywords: multi-object tracking (MOT); satellite video; transformer attention learning.