Regularized kernel matrix decomposition for thermal video multi-object detection and tracking

Ioannis D. Schizas; Vasileios Maroulas; Guohua Ren; Ioannis D. Schizas; Vasileios Maroulas; Guohua Ren

doi:10.3934/bdia.2018004

Big Data and Information Analytics

2018, Volume 3, Issue 2: 1-23. doi: 10.3934/bdia.2018004

Previous Article Next Article

Research article

Regularized kernel matrix decomposition for thermal video multi-object detection and tracking

1.
Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, 76019, USA
2.
Department of Mathematics, University of Tennessee at Knoxville, TN, 37996, USA

Received: 08 October 2018 Accepted: 16 October 2018 Published: 09 November 2018

This paper derives a novel algorithm for joint detection and tracking of multiple moving objects in thermal videos. The problem of determining multiple objects in a frame sequence is formulated as the task of factorizing a properly defined kernel covariance matrix into sparse factors. The support of these factors will point to the indices of the pixels that form each object. A coordinate descent approach is utilized to determine the sparse factors, and extract the object pixels. A centroid pixel is estimated for each object which is subsequently tracked via Kalman filtering. A novel interplay between the sparse kernel covariance factorization scheme along with Kalman filtering is proposed to enable joint object detection and tracking, while a divide and conquer strategy is put forth to reduce computational complexity and enable efficient tracking. Extensive numerical tests on both synthetic data and thermal video sequences demonstrate the effectiveness of the novel approach and superior tracking performance compared to existing alternatives.

Keywords:

Citation: Ioannis D. Schizas, Vasileios Maroulas, Guohua Ren. Regularized kernel matrix decomposition for thermal video multi-object detection and tracking[J]. Big Data and Information Analytics, 2018, 3(2): 1-23. doi: 10.3934/bdia.2018004

Related Papers:

[1]	Tao Wu, Yu Lei, Jiao Shi, Maoguo Gong . An evolutionary multiobjective method for low-rank and sparse matrix decomposition. Big Data and Information Analytics, 2017, 2(1): 23-37. doi: 10.3934/bdia.2017006
[2]	Yaguang Huangfu, Guanqing Liang, Jiannong Cao . MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data and Information Analytics, 2016, 1(4): 349-376. doi: 10.3934/bdia.2016015
[3]	Cun Chen, Hui Peng . Dynamic mode decomposition for blindly separating mixed signals and decrypting encrypted images. Big Data and Information Analytics, 2024, 8(0): 1-25. doi: 10.3934/bdia.2024001
[4]	David E. Bernholdt, Mark R. Cianciosa, David L. Green, Kody J.H. Law, Alexander Litvinenko, Jin M. Park . Comparing theory based and higher-order reduced models for fusion simulation data. Big Data and Information Analytics, 2018, 3(2): 41-53. doi: 10.3934/bdia.2018006
[5]	Ming Yang, Dunren Che, Wen Liu, Zhao Kang, Chong Peng, Mingqing Xiao, Qiang Cheng . On identifiability of 3-tensors of multilinear rank (1; L_r; L_r). Big Data and Information Analytics, 2016, 1(4): 391-401. doi: 10.3934/bdia.2016017
[6]	Xiaoying Chen, Chong Zhang, Zonglin Shi, Weidong Xiao . Spatio-temporal Keywords Queries in HBase. Big Data and Information Analytics, 2016, 1(1): 81-91. doi: 10.3934/bdia.2016.1.81
[7]	Elnaz Delpisheh, Aijun An, Heidar Davoudi, Emad Gohari Boroujerdi . Time aware topic based recommender System. Big Data and Information Analytics, 2016, 1(2): 261-274. doi: 10.3934/bdia.2016008
[8]	Prince Peprah Osei, Ajay Jasra . Estimating option prices using multilevel particle filters. Big Data and Information Analytics, 2018, 3(2): 24-40. doi: 10.3934/bdia.2018005
[9]	Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129
[10]	Yaru Cheng, Yuanjie Zheng . Frequency filtering prompt tuning for medical image semantic segmentation with missing modalities. Big Data and Information Analytics, 2024, 8(0): 109-128. doi: 10.3934/bdia.2024006

Abstract

1. Introduction

Tracking of moving objects in videos is a fundamental problem in computer vision, and a plethora of approaches have been put forth to address the tracking problem using RGB (red, green, blue) cameras, see e.g., ^{[4, 15, 19, 29]}. Nonetheless, there are many challenges that still need to be addressed such as object/camera motion, varying appearances of the objects, different illumination conditions and occlusions. Further, the presence of a changing number of multiple objects in a frame sequence makes tracking still an extremely challenging problem.

Recently, uncooled thermal sensors have become affordable and achieve improved resolution capability ^[7]. Further, there is an increasing interest in utilizing thermal sensors to facilitate vision tasks, such as face recognition, and human-robot interaction, ^{[5, 31]}. Moreover, in moving object tracking applications like outdoor surveillance, where usually the background temperature is largely different from the moving objects, thermal imaging becomes crucial in detecting and tracking those objects that radiate thermal energy such as humans, animals or vehicles. It is noteworthy that thermal imaging is not affected by shadow and light illumination, which normally is a bottleneck for RGB cameras, rendering it more suitable for moving object tracking in both daytime and nighttime ^[22]. In ^[34], a sparse representation technique is utilized to extract features for the video objects. Compressed feature vectors are first obtained by the sparse representation technique, then a Bayes binary classifier is designed to track the object. A subspace model is learned in ^[3] to model the object of interest in videos, though it is an offline tracking approach. ^[22] proposed to use a particle filter to track object motion features preprocessed from the Wigner distribution. Support vector machines and Kalman filtering are combined toward identifying and tracking pedestrians in ^[35]. In ^[36], a scheme is developed to detect the pedestrian head, and pedestrian legs which are later tracked by local search. The aforementioned approaches are limited in the sense that they cannot jointly detect and track multiple objects, while they have to impose certain pixel intensity thresholding or statistical/structural assumptions for the objects present.

Thermal cameras output corresponds usually to gray scale imaging, which results in a lower data processing complexity, in contrast to the triple data load produced by RGB cameras. Also, there are some research efforts that propose fusion of thermal and RGB visible data, e.g., ^{[6, 8, 11]}. The work in ^[6] relies on the contour saliency map, to fuse together object locations and contours from both thermal and color sensors and eventually extract the object silhouette features, thus obtaining improved tracking performance. However, the method is computationally expensive since it aims at constructing a complete object contour. In ^[11], data fusion is implemented to fuse thermal and visible data, resulting in an illumination-invariant face image. In the latter work, decision fusion combines the matching score generated from individual face recognition models. Indeed, modal fusion enables better tracking performance since more data are utilized. However, in many practical scenarios where only one of the imaging modalities is available to use, tracking systems can benefit from the utility of thermal data due to the computational cost savings introduced.

In this paper we propose a novel approach to perform joint detection and tracking of multiple moving objects in thermal videos. Having no prior information on the objects present in the video frames, the object detection problem is formulated as the problem of factorizing a pertinent kernel covariance matrix into sparse factors. The pixels consisting of an object will be determined by estimating the support of these sparse factors and employing clustering of the nonzero entries to separate individual objects. Each object will be tracked via an alternative implementation of Kalman filtering, which has been used extensively in target tracking using sensor networks ^{[16, 20, 24, 25, 26]}, and the proposed kernel matrix sparse factorization scheme. The idea of sparse covariance factorization was first explored in ^[28] to determine informative sensors in a network. However, in ^[28], linear data models are considered which is not the case in the video object tracking setting considered here. Further, the approach in ^[28] focuses in detecting stationary and static sources, whereas in the proposed work here nonlinear inter-pixel correlations are extracted and utilized along with multiple object dynamics to achieve accurate multi-object tracking. Coordinate descent techniques ^{[2, 32]}, are employed to decompose the formulated kernel covariance matrix in a recursive way. Moreover, the implementation of a computationally efficient 'divide-and-conquer' based scheme mitigates the high computational burden of factorizing large kernel covariance matrices resulting from frames having large dimensions and acquired at fast rates. The Kalman filter ^[17] is further combined with the aforementioned sparse kernel covariance factorization scheme to allow precise tracking of the detected objects in videos.

The paper is structured as follows. The problem setting is introduced in Sec. Ⅱ. The problem of clustering pixels according to the object they belong to is formulated as a sparse kernel covariance factorization problem and is detailed in Sec. Ⅲ. A computational efficient method is derived based on coordinate descent approaches. Kalman filtering is employed to track the estimated centroid pixel of the detected objects in Sec. Ⅳ. A divide and conquer implementation is described in Sec. Ⅴ, along with the interplay of Kalman filtering and our proposed sparse kernel covariance factorization algorithm. Numerical results in different scenarios in Secs. Ⅵ, Ⅶ, Ⅷ are carried out to corroborate the effectiveness of the proposed multiple object tracking scheme in thermal videos, while demonstrating the superiority of the novel approach over existing alternatives.

2. Problem setting and preliminaries

Consider a sequence of frames forming a video in which the frames contain multiple nonstationary/moving objects of interest that need to be detected/identified and tracked. Let ${\bf{F}}_t$ be the frame of the available video sequence at time instant $t$ of dimensions $f_x\times f_y$ whose entries are real numbers, i.e., ${\bf{F}}_t\in\mathbb{R}^{f_x\times f_y}$ . Further, let $x_t^{m, n}$ denote the pixel intensity of the $(m, n)$ -th pixel of frame ${\bf{F}}_t$ at time instant $t$ where $m = 1, \ldots, f_x$ and $n = 1, \ldots, f_y$ . For simplicity in exposition ${\bf{x}}_t\in\mathcal{R}^{p\times 1}$ , with $p = f_x\cdot f_y$ denotes a vector that contains all the pixels of frame ${\bf{F}}_t$ placed in there after traversing them from top to bottom and left to right. For the sake of simplicity later on we will omit the ${\bf{f}}$ index in ${{\bf{x}}}_t$ .

There is an unknown number of objects in the video that we are interested in tracking, and $M$ denotes the maximum number of objects that can be present in a frame. Let ${\cal P}_m^t$ denote the set of pixel indices corresponding to the $m$ th object at time instant $t$ , i.e., ${\cal P}_m^t: = \{[x_{m, 1, t}, y_{m, 1, t}], \ldots, [x_{m, N_m, t}, y_{m, N_m, t}]\}$ indicate the coordinates of the pixels of the $m$ th object at time instant $t$ , with $N_m$ indicating the number of pixels of object $m$ . The pixels corresponding to an object at time instant $t$ , say ${\cal P}_m^t$ , are not known. In order to model the movement of each of the objects we will focus on how the coordinates of the centroid pixel of each object evolve in time. The centroid pixel for the $m$ th object at time instant $t$ is defined as

${\bf{c}}_m^t: = \lfloor{N_m^{-1}\sum\limits_{i = 1}^{N_m}[x_{m, n, t}, y_{m, n, t}]}\rfloor,$

with $\lfloor{}\rfloor$ denoting the floor operator. The centroid pixel intuitively describes the center point of the $m$ th moving object.

When the video sequence is acquired at a high frame rate, it can be assumed that the objects' centroid pixels move from frame to frame according to a constant velocity moving model ^[1]. The velocity can be kept constant for a certain number of frames and then changed if necessary. Specifically, the $m$ -th object's state vector is denoted as ${{\bf{s}}}_m(t): = [({\bf{c}}_m^t)^T, {\bf{v}}_m^t]$ , where ${\bf{v}}_m^t$ is a $2\times 1$ vector that contains the velocity across the horizontal and vertical axis. The state vector ${{\bf{s}}}_m(t)$ is assumed to evolve according to the following model:

${\bf{s}}_{m}(t) = {\bf{A}}{\bf{s}}_{m}(t-1) + {\bf{u}}_{m}(t), \;\;m = 1, \ldots, M$

(2.1)

where ${\bf{A}}\in\mathcal{R}^{4\times 4}$ is the state transition matrix, while ${\bf{u}}_{m}(t)$ denotes zero-mean Gaussian noise with covariance ${\bf{\Sigma}}_u$ . The matrices ${\bf{A}}$ and ${\bf{\Sigma}}_u$ have the following structure (e.g., see ^[1])

$\begin{equation}\label{Eq:Transition_Matrix} {\bf{A}} = \left[\begin{array}{cccccc} 1&0&\Delta T&0\\ 0&1&0&\Delta T\\ 0&0&1&0\\ 0&0&0&1\\ \end{array} \right], \end{equation}$

(2.2)

$\begin{equation} {\bf{\Sigma}}_{u} = \sigma_u^2\left[\hspace{-0.2cm}\begin{array}{crcr} (\Delta T)^3/3\cdot {\bf{I}}_2& (\Delta T)^2/2\cdot{\bf{I}}_2\\ (\Delta T)^2/2\cdot{\bf{I}}_2&\Delta T\cdot{\bf{I}}_2\\ \end{array} \hspace{-0.2cm}\right], \end{equation}$

(2.3)

where $\Delta T$ corresponds to the inter-frame time interval, $\sigma^2_u$ is a nonnegative constant controlling the variance of the noise entries in ${\bf{u}}_{m}(t)$ , while ${\bf{I}}_2$ denotes the $2\times 2$ identity matrix. The pixel coordinates $[x_{m, 1, t}, y_{m, 1, t}]$ take integer values, however the state noise ${\bf{u}}_{m}(t)$ is assumed Gaussian causing the state to take real values. Though, we can control the state noise standard deviation such that $3\sigma_u$ ( $3-\sigma$ bounds) is equal to a small number corresponding to the number of pixels that the state is deviating from the constant velocity movement. Since the noise will take values within the interval $[-3\sigma_u, 3\sigma_u]$ with probability $99.7\%$ , the state noise can model at an acceptable level of accuracy the movement of the video objects from frame to frame despite the fact that the centroid ${\bf{c}}_m(t)$ can have real values.

2.1. Kernel-based object pixel correlations

As stated earlier the pixels corresponding to an object are unknown, thus it is essential to identify the objects before attempting to track them. To cluster the pixels of interest according to the object they belong to, we will utilize statistical correlations that pixels belonging to the same object exhibit. Pixels of an object are expected to have similar intensity (different from the background pixels) which subsequently makes them correlated.

Pixels belonging to the same object exhibit nonlinear dependencies in general in the sense of being nonlinearly correlated ^[14], thus employing a linear covariance matrix will not identify correlated components. To this end, we account for the nonlinear inter-pixel correlations of frame ${\bf{F}}_t$ , namely ${\bf{x}}_t: = \textrm{vec}({\bf{F}}_t)$ where $\textrm{vex}(.)$ is the vectorization operator, by utilizing nonlinear mappings ${\pmb{\phi}}_x({\bf{x}}_{t})$ that are applied row-wise across the entries of ${\bf{x}}_{t}$ and map each pixel in ${\bf{x}}_t$ , in a higher dimensional space where linear correlations can be exploited. Specifically, the mapping ${\pmb{\phi}}_x$ results a $f_xf_y\times D$ matrix

$\begin{equation}\label{Eq:Vec_x} {\pmb{\phi}}_x({\bf{x}}_{t}): = [{\pmb{\phi}}_x({\bf{x}}_{t}(1)), \ldots, {\pmb{\phi}}_x({\bf{x}}_{t}(f_xf_y))]^T, \end{equation}$

(2.4)

where $D$ corresponds to the dimensionality of the transformed pixel vector ${\pmb{\phi}}_x({\bf{x}}_{t}(i))$ for $i = 1, \ldots, f_xf_y$ .

The nonlinear mapping ${\pmb{\phi}}_x$ should be selected such that the covariance matrix of the transformed frames exhibits a block diagonal structure. For example Figure 1 depicts the nonzero entries (black dots) of a kernel (the Gaussian kernel will be used) covariance matrix obtained from a sequence of frames in which a single white object moves within black background. Clearly, the covariance matrix has a block diagonal structure after proper permutation of the rows and columns that contain the nonzero entries.

Figure 1. Sparse structure of a kernel covariance matrix.

	Novel object detection	Pixel thresholding
Synthetic video sequence	3	3
Thermal video sequence 1	1	2
Thermal video sequence 2	2	1

[1]	Bar-Shalom Y, (2001) EstimationWith Applications to Tracking and Navigation. New York: Wiley.
[2]	Bertsekas DP, (2003) Nonlinear Programming, Second Edition, Athena Scientific.
[3]	Black MJ, Jepson AD (1998) Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. Int J Comput Vision 26: 63–84. doi: 10.1023/A:1007939232436
[4]	Chen IK, Hsu SL, Chi CY, et al. (2014) Automatic video segmentation and object tracking with real-time RGB-D data Proc of the IEEE Intl Conf on Consum Electr: 486–487.
[5]	Cielniak G, Duckett T, Lilienthal AJ (2007) Improved data association and occlusion handling for vision-based people tracking by mobile robots. Proc of IEEE Intl Conf on Intell Robot Syst.
[6]	Davis JW, Sharma V (2007) Background-subtraction using contour-based fusion of thermal and visible imagery. Comput Vis Image Und 106: 162–182. doi: 10.1016/j.cviu.2006.06.010
[7]	Ennulat RD, Pommerrenig D (1988) Uncooled high resolution infrared imaging plane.
[8]	Hanif M, Ali U (2006) Optimized visual and thermal image fusion for e cient face recognition. Proc of IEEE Intl Conf on Inform Fusion.
[9]	Harchaoui Z, Bach F (2007) Image classification with segmentation graph kernels. Proc of IEEE Conf on Comput Vis and Pattern Recogn.
[10]	Hartigan JA, Wong MA (1979) Algorithm AS 136: A K-means clustering algorithm. J R Stat Soc C-Appl 28: 100–108.
[11]	Heo J, Kong SG, Abidi BR, et al. (2004) Fusion of visual and thermal signatures with eyeglass removal for robust face recognition. Proc of IEEE Conf on Comput Vis and Pattern Recogn Workshop, 122–122.
[12]	Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat, 1171–1220.
[13]	Horn RA, (1985) Matrix Analysis. Cambridge, U.K.: Cambridge Univ. Press.
[14]	Isola P, Zoran D, Krishnan D, et al. (2014) Crisp boundary detection using pointwise mutual information. European Conference on Computer Vision, 799–814.
[15]	Jiang M, Pan Z, Tang Z (2017) Visual object tracking based on Cross-modality Gaussian-Bernoulli deep Boltzmann machines with RGB-D sensors. Sensors 121.
[16]	Kang K, Maroulas V, Schizas I, et al. (2018) Improved distributed particle filters for tracking in wireless sensor network. Comput Stat Data An 117: 90–108. doi: 10.1016/j.csda.2017.07.009
[17]	Kay SM, (1993) Fundamental of Statistical Signal Processing: Estimation Theory, Prentice Hall.
[18]	Kwak N (2012) Kernel discriminant analysis for regression problems. Pattern Recogn 45: 2019–2031. doi: 10.1016/j.patcog.2011.11.006
[19]	Luber M, Spinello L, Arras KO (2011) People tracking in RGB-D data with on-line boosted target models. Proc of IEEE Intl Conf on Intell Robot Syst.
[20]	Maroulas V, Stinis P (2012) Improved particle filters for multi-target tracking. J Comput Phys 231: 602–611. doi: 10.1016/j.jcp.2011.09.023
[21]	IEEE OTCBVS WS Series Bench; Roland Miezianko, Terravic Research Infrared Database. Available: http://vcipl-okstate.org/pbvs/bench/
[22]	Padole CN, Alexandre LA (2010) Motion based particle filter for human tracking with thermal imaging. Proc of IEEE Intl Conf on Emerging Trends in Engineering and Technology (ICETET): 158–162.
[23]	Peng J, Zhou Y, Chen CLP (2015) Region-kernel-based support vector machines for hyperspectral image classification IEEE T Geosci Remote 53: 4810–4824.
[24]	Ren G, Maroulas V, Schizas ID (2015) Distributed Sensors-Targets Spatiotemporal Association and Tracking. IEEE Taes 51: 2570–2589.
[25]	Ren G, Maroulas V, Schizas ID (2016) Decentralized Sparsity-Based Multi-Source Association and State Tracking. Signal Process 120: 627–643. doi: 10.1016/j.sigpro.2015.10.013
[26]	Ren G, Maroulas V, Schizas ID (2016) Exploiting sensor mobility and covariance sparsity for distributed tracking of multiple targets. J Adv Sig Pr 2016: 53.
[27]	Rosipal R, Girolami M, Trejo LJ, et al. (2001) Kernel PCA for feature extraction and de-noising in nonlinear regression. Neural Comput Appl 10: 231–243. doi: 10.1007/s521-001-8051-z
[28]	Schizas ID (2013) Distributed informative-sensor identification via sparsity-aware matrix factorization. IEEE Trans on Sig Proc 61: 4610–4624. doi: 10.1109/TSP.2013.2269044
[29]	Teichman A, Lussier JT, Thrun S (2013) Learning to segment and track in RGBD. IEEE T Autom Sci Eng 10: 841–852. doi: 10.1109/TASE.2013.2264286
[30]	Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B 58: 267–288.
[31]	Treptow A, Cielniak G, Duckett T (2005) Active people recognition using thermal and grey images on a mobile security robot. Proc of IEEE Intl Conf on Intell Robot Syst.
[32]	Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Opt Theory App 109: 475–494. doi: 10.1023/A:1017501703105
[33]	Vert JP, Tsuda K, Schölkopf B (2004) A primer on kernel methods. Kernel Methods in Comput Biol: 35–70.
[34]	Zhang K, Zhang L, Yang and M (2012) Real-time compressive tracking. European Conference on Computer Vision: 864–877.
[35]	Xu F, Liu X, Fujimura K (2005) Pedestrian detection and tracking with night vision. IEEE T Intell Transp: 63–71.
[36]	Yasuno M, Yasuda N, Aoki M (2005) Pedestrian detection and tracking in far infrared images. IEEE Conf on Intell Transport S: 131–136.
[37]	Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat: 15.

Big Data and Information Analytics

Regularized kernel matrix decomposition for thermal video multi-object detection and tracking

Related Papers:

Abstract

1. Introduction

2. Problem setting and preliminaries

2.1. Kernel-based object pixel correlations

3. Multi-object pixel clustering

3.1. Kernel covariance estimation

3.2. Kernel covariance sparse factorization

3.3. Multiple objects

4. Frame object tracking

5. Real-time object identification and tracking

5.1. Dealing with large frames

5.2. Object identification and tracking

6. Synthetic numerical tests

7. Multi-object detection and tracking in thermal video

8. racking with missing pixels

9. Concluding remarks

Acknowledgments

Supplementary

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog