Our goal is to develop real-time vehicle detection and tracking schemes for fisheye traffic monitoring video using the temporal information in the compressed domain without decoding the entire video. Two algorithms are proposed. The first algorithm starts with a conventional single-frame detector, but we introduce a multi-frame information fusion stage to improve the final detection and tracking accuracy, which is implemented using multi-modal bi-directional LSTM (MM bi-LSTM) network. The second algorithm first constructs multi-frame motion trail image, and then a single-image multi-head detector is designed to produce bounding boxes of an individual frame. The first scheme can be viewed as a detect-to-track design, and the second scheme is track-to-detect. We tested our proposals on the ICIP2020 VIP Cup dataset in H.265 video format. The aforementioned algorithms are applied to the motion fields and residual images in the H.265 compressed data set. It turns out that their detection and tracking performances are on par with their pixel-domain counterparts, and they can achieve the state-of-the-art accuracy of conventional video object detectors and trackers. If the decoding process for video compression is not counted, their computational complexities are much lower than the conventional pixel-domain video object detectors and trackers.
Companion
APSIPA Transactions on Signal and Information Processing Special Issue - Emerging AI Technologies for Smart Infrastructure
See the other articles that are part of this special issue.