Citation: Ioannis D. Schizas, Vasileios Maroulas, Guohua Ren. Regularized kernel matrix decomposition for thermal video multi-object detection and tracking[J]. Big Data and Information Analytics, 2018, 3(2): 1-23. doi: 10.3934/bdia.2018004
[1] | Tao Wu, Yu Lei, Jiao Shi, Maoguo Gong . An evolutionary multiobjective method for low-rank and sparse matrix decomposition. Big Data and Information Analytics, 2017, 2(1): 23-37. doi: 10.3934/bdia.2017006 |
[2] | Yaguang Huangfu, Guanqing Liang, Jiannong Cao . MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data and Information Analytics, 2016, 1(4): 349-376. doi: 10.3934/bdia.2016015 |
[3] | Cun Chen, Hui Peng . Dynamic mode decomposition for blindly separating mixed signals and decrypting encrypted images. Big Data and Information Analytics, 2024, 8(0): 1-25. doi: 10.3934/bdia.2024001 |
[4] | David E. Bernholdt, Mark R. Cianciosa, David L. Green, Kody J.H. Law, Alexander Litvinenko, Jin M. Park . Comparing theory based and higher-order reduced models for fusion simulation data. Big Data and Information Analytics, 2018, 3(2): 41-53. doi: 10.3934/bdia.2018006 |
[5] | Ming Yang, Dunren Che, Wen Liu, Zhao Kang, Chong Peng, Mingqing Xiao, Qiang Cheng . On identifiability of 3-tensors of multilinear rank (1; Lr; Lr). Big Data and Information Analytics, 2016, 1(4): 391-401. doi: 10.3934/bdia.2016017 |
[6] | Xiaoying Chen, Chong Zhang, Zonglin Shi, Weidong Xiao . Spatio-temporal Keywords Queries in HBase. Big Data and Information Analytics, 2016, 1(1): 81-91. doi: 10.3934/bdia.2016.1.81 |
[7] | Elnaz Delpisheh, Aijun An, Heidar Davoudi, Emad Gohari Boroujerdi . Time aware topic based recommender System. Big Data and Information Analytics, 2016, 1(2): 261-274. doi: 10.3934/bdia.2016008 |
[8] | Prince Peprah Osei, Ajay Jasra . Estimating option prices using multilevel particle filters. Big Data and Information Analytics, 2018, 3(2): 24-40. doi: 10.3934/bdia.2018005 |
[9] | Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129 |
[10] | Yaru Cheng, Yuanjie Zheng . Frequency filtering prompt tuning for medical image semantic segmentation with missing modalities. Big Data and Information Analytics, 2024, 8(0): 109-128. doi: 10.3934/bdia.2024006 |
Tracking of moving objects in videos is a fundamental problem in computer vision, and a plethora of approaches have been put forth to address the tracking problem using RGB (red, green, blue) cameras, see e.g., [4, 15, 19, 29]. Nonetheless, there are many challenges that still need to be addressed such as object/camera motion, varying appearances of the objects, different illumination conditions and occlusions. Further, the presence of a changing number of multiple objects in a frame sequence makes tracking still an extremely challenging problem.
Recently, uncooled thermal sensors have become affordable and achieve improved resolution capability [7]. Further, there is an increasing interest in utilizing thermal sensors to facilitate vision tasks, such as face recognition, and human-robot interaction, [5, 31]. Moreover, in moving object tracking applications like outdoor surveillance, where usually the background temperature is largely different from the moving objects, thermal imaging becomes crucial in detecting and tracking those objects that radiate thermal energy such as humans, animals or vehicles. It is noteworthy that thermal imaging is not affected by shadow and light illumination, which normally is a bottleneck for RGB cameras, rendering it more suitable for moving object tracking in both daytime and nighttime [22]. In [34], a sparse representation technique is utilized to extract features for the video objects. Compressed feature vectors are first obtained by the sparse representation technique, then a Bayes binary classifier is designed to track the object. A subspace model is learned in [3] to model the object of interest in videos, though it is an offline tracking approach. [22] proposed to use a particle filter to track object motion features preprocessed from the Wigner distribution. Support vector machines and Kalman filtering are combined toward identifying and tracking pedestrians in [35]. In [36], a scheme is developed to detect the pedestrian head, and pedestrian legs which are later tracked by local search. The aforementioned approaches are limited in the sense that they cannot jointly detect and track multiple objects, while they have to impose certain pixel intensity thresholding or statistical/structural assumptions for the objects present.
Thermal cameras output corresponds usually to gray scale imaging, which results in a lower data processing complexity, in contrast to the triple data load produced by RGB cameras. Also, there are some research efforts that propose fusion of thermal and RGB visible data, e.g., [6, 8, 11]. The work in [6] relies on the contour saliency map, to fuse together object locations and contours from both thermal and color sensors and eventually extract the object silhouette features, thus obtaining improved tracking performance. However, the method is computationally expensive since it aims at constructing a complete object contour. In [11], data fusion is implemented to fuse thermal and visible data, resulting in an illumination-invariant face image. In the latter work, decision fusion combines the matching score generated from individual face recognition models. Indeed, modal fusion enables better tracking performance since more data are utilized. However, in many practical scenarios where only one of the imaging modalities is available to use, tracking systems can benefit from the utility of thermal data due to the computational cost savings introduced.
In this paper we propose a novel approach to perform joint detection and tracking of multiple moving objects in thermal videos. Having no prior information on the objects present in the video frames, the object detection problem is formulated as the problem of factorizing a pertinent kernel covariance matrix into sparse factors. The pixels consisting of an object will be determined by estimating the support of these sparse factors and employing clustering of the nonzero entries to separate individual objects. Each object will be tracked via an alternative implementation of Kalman filtering, which has been used extensively in target tracking using sensor networks [16, 20, 24, 25, 26], and the proposed kernel matrix sparse factorization scheme. The idea of sparse covariance factorization was first explored in [28] to determine informative sensors in a network. However, in [28], linear data models are considered which is not the case in the video object tracking setting considered here. Further, the approach in [28] focuses in detecting stationary and static sources, whereas in the proposed work here nonlinear inter-pixel correlations are extracted and utilized along with multiple object dynamics to achieve accurate multi-object tracking. Coordinate descent techniques [2, 32], are employed to decompose the formulated kernel covariance matrix in a recursive way. Moreover, the implementation of a computationally efficient 'divide-and-conquer' based scheme mitigates the high computational burden of factorizing large kernel covariance matrices resulting from frames having large dimensions and acquired at fast rates. The Kalman filter [17] is further combined with the aforementioned sparse kernel covariance factorization scheme to allow precise tracking of the detected objects in videos.
The paper is structured as follows. The problem setting is introduced in Sec. Ⅱ. The problem of clustering pixels according to the object they belong to is formulated as a sparse kernel covariance factorization problem and is detailed in Sec. Ⅲ. A computational efficient method is derived based on coordinate descent approaches. Kalman filtering is employed to track the estimated centroid pixel of the detected objects in Sec. Ⅳ. A divide and conquer implementation is described in Sec. Ⅴ, along with the interplay of Kalman filtering and our proposed sparse kernel covariance factorization algorithm. Numerical results in different scenarios in Secs. Ⅵ, Ⅶ, Ⅷ are carried out to corroborate the effectiveness of the proposed multiple object tracking scheme in thermal videos, while demonstrating the superiority of the novel approach over existing alternatives.
Consider a sequence of frames forming a video in which the frames contain multiple nonstationary/moving objects of interest that need to be detected/identified and tracked. Let Ft be the frame of the available video sequence at time instant t of dimensions fx×fy whose entries are real numbers, i.e., Ft∈Rfx×fy. Further, let xm,nt denote the pixel intensity of the (m,n)-th pixel of frame Ft at time instant t where m=1,…,fx and n=1,…,fy. For simplicity in exposition xt∈Rp×1, with p=fx⋅fy denotes a vector that contains all the pixels of frame Ft placed in there after traversing them from top to bottom and left to right. For the sake of simplicity later on we will omit the f index in xt.
There is an unknown number of objects in the video that we are interested in tracking, and M denotes the maximum number of objects that can be present in a frame. Let Ptm denote the set of pixel indices corresponding to the mth object at time instant t, i.e., Ptm:={[xm,1,t,ym,1,t],…,[xm,Nm,t,ym,Nm,t]} indicate the coordinates of the pixels of the mth object at time instant t, with Nm indicating the number of pixels of object m. The pixels corresponding to an object at time instant t, say Ptm, are not known. In order to model the movement of each of the objects we will focus on how the coordinates of the centroid pixel of each object evolve in time. The centroid pixel for the mth object at time instant t is defined as
ctm:=⌊N−1mNm∑i=1[xm,n,t,ym,n,t]⌋, |
with ⌊⌋ denoting the floor operator. The centroid pixel intuitively describes the center point of the mth moving object.
When the video sequence is acquired at a high frame rate, it can be assumed that the objects' centroid pixels move from frame to frame according to a constant velocity moving model [1]. The velocity can be kept constant for a certain number of frames and then changed if necessary. Specifically, the m-th object's state vector is denoted as sm(t):=[(ctm)T,vtm], where vtm is a 2×1 vector that contains the velocity across the horizontal and vertical axis. The state vector sm(t) is assumed to evolve according to the following model:
sm(t)=Asm(t−1)+um(t),m=1,…,M | (2.1) |
where A∈R4×4 is the state transition matrix, while um(t) denotes zero-mean Gaussian noise with covariance Σu. The matrices A and Σu have the following structure (e.g., see [1])
A=[10ΔT0010ΔT00100001], | (2.2) |
Σu=σ2u[(ΔT)3/3⋅I2(ΔT)2/2⋅I2(ΔT)2/2⋅I2ΔT⋅I2], | (2.3) |
where ΔT corresponds to the inter-frame time interval, σ2u is a nonnegative constant controlling the variance of the noise entries in um(t), while I2 denotes the 2×2 identity matrix. The pixel coordinates [xm,1,t,ym,1,t] take integer values, however the state noise um(t) is assumed Gaussian causing the state to take real values. Though, we can control the state noise standard deviation such that 3σu (3−σ bounds) is equal to a small number corresponding to the number of pixels that the state is deviating from the constant velocity movement. Since the noise will take values within the interval [−3σu,3σu] with probability 99.7%, the state noise can model at an acceptable level of accuracy the movement of the video objects from frame to frame despite the fact that the centroid cm(t) can have real values.
As stated earlier the pixels corresponding to an object are unknown, thus it is essential to identify the objects before attempting to track them. To cluster the pixels of interest according to the object they belong to, we will utilize statistical correlations that pixels belonging to the same object exhibit. Pixels of an object are expected to have similar intensity (different from the background pixels) which subsequently makes them correlated.
Pixels belonging to the same object exhibit nonlinear dependencies in general in the sense of being nonlinearly correlated [14], thus employing a linear covariance matrix will not identify correlated components. To this end, we account for the nonlinear inter-pixel correlations of frame Ft, namely xt:=vec(Ft) where vex(.) is the vectorization operator, by utilizing nonlinear mappings ϕϕx(xt) that are applied row-wise across the entries of xt and map each pixel in xt, in a higher dimensional space where linear correlations can be exploited. Specifically, the mapping ϕϕx results a fxfy×D matrix
ϕϕx(xt):=[ϕϕx(xt(1)),…,ϕϕx(xt(fxfy))]T, | (2.4) |
where D corresponds to the dimensionality of the transformed pixel vector ϕϕx(xt(i)) for i=1,…,fxfy.
The nonlinear mapping ϕϕx should be selected such that the covariance matrix of the transformed frames exhibits a block diagonal structure. For example Figure 1 depicts the nonzero entries (black dots) of a kernel (the Gaussian kernel will be used) covariance matrix obtained from a sequence of frames in which a single white object moves within black background. Clearly, the covariance matrix has a block diagonal structure after proper permutation of the rows and columns that contain the nonzero entries.
After properly selecting a kernel (see details later on) the covariance of the transformed data ϕϕx(xt) can be written as
E[(ϕϕx(xt)−E[ϕϕx(xt)])(ϕϕx(xt)−E[ϕϕx(xt)])T]=Prbdiag(B1,t,B2,t,…,BM,t)Pc, | (2.5) |
where bdiag() refers to a block diagonal matrix, with Bm,t denoting the mth diagonal block of size Nm×Nm indicating how the Nm pixels of object m are correlated, while pixels belonging to different component are assumed to be uncorrelated. It is also possible that pixels from different objects may be correlated if the objects have similar pixel intensity or texture. In that case one diagonal block, namely Bj,t, may be associated with more than one objects and we will see later on how the pixels can be separated. Further, matrices Pr and Pc corresponds to an arbitrary unknown permutation of the rows and columns.
The first challenge will be to locate the pixels of each object, which pertains to identifying where the entries of each of the M diagonal blocks are located in the transformed covariance matrix in (2.5), which boils down to estimating the size of each diagonal block Nm, as well as the indices of the pixels that belong to the mth diagonal block. In the following section we will formulate this as a sparse matrix factorization problem, while properly selecting the kernel to induce a covariance matrix with block diagonal structure. Then, once the pixels of an object have been determined we will proceed with tracking the estimated centroid of each of the objects.
In order to estimate the covariance matrix of the transformed data in (2.5) we will rely on sample-averaging, where after applying a proper transformation to the pixel vectors to obtain ϕϕx(xt) we estimate the covariance of the transformed data as
ˆΣϕx,t=1FΣt⋅Fτ=(t−1)⋅F+1(ϕϕx(xτ)−¯ϕϕx,t)⋅(ϕϕx(xτ)−¯ϕϕx,t)T=1FΣt⋅Fτ=(t−1)⋅F+1[ϕϕx(xτ)ϕϕTx(xτ)−ϕϕx(xτ)¯ϕϕTx,t−¯ϕϕx,tϕϕTx(xτ)+¯ϕϕx,t¯ϕϕTx,t], | (3.1) |
where ¯ϕϕx,t:=F−1Σt⋅Fτ=(t−1)⋅F+1ϕϕx(xτ) corresponds to the sample-average estimate of the mean of the transformed frame pixels, while F corresponds to the number (different than Ft that denotes the frame at time t) of frames that the objects are virtually stationary and occupy the same area in the frames. The higher the sampling rate is, the larger F can be chosen while assuming the objects are stationary within the time interval [(t−1)F+1,tF].
Applying the kernel trick (assuming a proper nonlinear mapping ϕϕx(⋅) is used; see details in [12, 18, 27, 33]) the inner products involved in calculating the entries of ϕϕx(xτ)ϕϕTx(xτ) can be found using a proper positive definite kernel function K(x1(i),x2(j)) whose two arguments correspond to pixels i and j from frame pixel vectors x1 and x2 respectively, i.e., the kernel trick implies that the inner product ⟨ϕϕx(x1(i)),ϕϕx(x2(j))⟩ can be evaluated from a proper scalar kernel function K(x1,x2). Utilizing this property the first term in Eq (3.1) can be rewritten as
1FΣt⋅Fτ=(t−1)⋅F+1ϕϕx(xτ)ϕϕTx(xτ)=1FΣt⋅Fτ=(t−1)⋅F+1K(xτ,xτ), | (3.2) |
where K is a fxfy×fxfy matrix whose (i,j)-th entry is given as [K]i,j=K(xτ(i),xτ(j)). Similarly, the second and third summation terms in Eq 3.1 can be expressed as
1FΣt⋅Fτ=(t−1)⋅F+1ϕϕx(xτ)¯ϕϕTx,t=1FΣt⋅Fτ=(t−1)⋅F+1ϕϕx(xτ)1FΣt⋅Fτ′=(t−1)⋅F+1ϕϕxT(xτ′)=1F⋅1FΣt⋅Fτ=(t−1)⋅F+1Σt⋅Fτ′=(t−1)⋅F+1ϕϕx(xτ)ϕϕxT(xτ′)=1F2Σt⋅Fτ,τ′=(t−1)⋅F+1K(xτ,xτ′), | (3.3) |
while the fourth summation term in Eq (3.1) gives
1FΣt⋅Fτ=(t−1)⋅F+1¯ϕϕx,t¯ϕϕTx,t=1F2Σt⋅Fτ′=(t−1)⋅F+1ϕϕx(xτ′)Σt⋅Fτ″=(t−1)⋅F+1ϕϕTx(xτ″)=1F2Σt⋅Fτ″=(t−1)⋅F+1 K(xτ′,xτ″). | (3.4) |
Notice that the matrix in Eq(3.4) is the same with the one obtained in (3.3), thus the covariance matrix of the transformed data can be calculated with the aid of the kernel function K(⋅,⋅) as follows
Σϕϕx,t=1FΣt⋅Fτ=(t−1)⋅F+1K(xτ,xτ)−1F2Σt⋅Fτ′,τ″=(t−1)⋅F+1 K(xτ′,xτ″). | (3.5) |
A kernel utilized in image pixel classification successfully [9, 23], is the Gaussian radial basis function (RBF) in which the (i,j) entry of matrix K used earlier can be expressed as
K(xτ(i),xτ(j))=exp(−(xτ(i)−xτ(j))22σ2), | (3.6) |
where the variance σ2 is a crucial parameter that controls the degree of inter-pixel correlation. Details on how to select this parameters will be given in Sec. 6.
Next, we will try to determine sparse factors m1,…,mM such that ˆΣϕϕx,t≈∑Mm=1mmmTm=MtMTt, while the support of each of the factors gm will indicate the indices of the entries belonging to a block of correlated pixels in ˆΣϕϕx,t that subsequently belong to the same object. The idea of utilizing sparse matrix decomposition to identify correlated data was first proposed in [28] and here it is generalized under the realm of kernel-based nonlinear data transformations.
Single Object:
We start with the case where there is only one moving object in the frames of the video sequence. A standard least-squares based matrix decomposition scheme would minimize the Frobenius norm-based cost ‖Σϕϕϕϕx,t−MtMTt‖2F with respect to the factor estimates Mt∈Rp×1. However, such a formulation does not take into account the sparse structure of Mt. To this end, the following minimization framework is proposed:
ˆMt:=argminMt‖Σϕϕx,t−MtMTt‖2F+λ‖Mt‖1, | (3.7) |
where the norm-one term ‖⋅‖1 is utilized to induce sparsity in the column vector Mt, see e.g., [30, 37], in the column of Mt whose support will point to those pixels in a collection of F frames that contain the object of interest within interval [(t−1)F+1,t⋅F]. The parameter λ is the sparsity controlling coefficient that determines the number of zeros in Mt, i.e., the larger λ is, the more zero entries will be contained in the optimal solution ˆMt.
The cost in Eq (3.7) is nonconvex with respect to (wrt) Mt. To overcome this obstacle an iterative minimization scheme is derived next using coordinate descent strategies [2]. The cost in Eq (3.7) is minimized recursively wrt one entry of Mt, namely Mt(j), while keeping all other entries in Mt fixed to their latest updates.
Minimization of the cost in Eq (3.7) wrt Mt(j) while fixing the remaining variables to their latest update during coordinate cycle k gives the following solution for updating ˆMkt(j):
ˆMkt(j)=argminMt(j)2⋅p∑μ=1,μ≠j[Σϕϕx,t(j,μ)−Mt(j)ˆMk−1t(μ)]2+λ|Mt(j)|+[Σϕϕx,t(j,j)−M2t(j)]2. | (3.8) |
Discarding the terms that do not depend on Mt(j) and applying proper algebraic manipulations, the cost in Eq (3.8) can be rewritten as:
Jk(j)=(Mt(j))4+λ|Mt(j)|+(Mt(j))2[2p∑μ=1,μ≠j[ˆMk−1t(μ)]2−2δk(j,j)]−Mt(j)[4∑μ=1,μ≠jδk(j,μ)ˆMk−1t(μ)] | (3.9) |
where δk(j,μ):=Σϕϕx,t(j,μ) for j,μ=1,…,p. Given the most recent update ˆMk−1t from coordinate cycle k−1, as shown in Appendix A, the update for ˆMkt(j) will be the value which achieves the minimum cost in Eq (3.9) among the following candidates: ⅰ) 0; ⅱ) the real positive roots of the third-degree polynomial:
4⋅h3+4(∑μ=1,μ≠j[ˆMk−1t(μ)]2−δk(j,j))⋅h−4(p∑μ=1,μ≠jδk(j,μ)ˆMk−1t(μ))+λ=0 | (3.10) |
ⅲ) the real negative roots of the third-degree polynomial:
4⋅h3+4(∑μ=1,μ≠j[ˆMk−1t(μ)]2−δk(j,j))⋅h−4(p∑μ=1,μ≠jδk(j,μ)ˆMk−1t(μ))−λ=0b | (3.11) |
To obtain the roots for the above two third-degree polynomial, we utilized companion matrices, [13]. The proposed sparsity-aware kernel matrix decomposition algorithm is tabulated as Algorithm 1. In fact, convergence to at least a stationary point of the cost in Eq (3.7) is established in Appendix B.
Algorithm 1 Kernel Sparse Matrix Decomposition |
1: Using frames within time interval [(t−1)⋅F+1,t⋅F]: 2: Form the kernel covariance matrix using Eq (3.5). 3: Initialize Mt(j)'s with 0's. 4: for k=1,2,…,κ do 5: Evaluate δk(j,μ) for j,μ=1,…,p. 6: Determine the updates {ˆMkt(j)} after determining the positive roots of Eq (3.10) and the negative roots of Eq (3.11). 7: If ‖Mkt−Mk−1t‖≤ϵ, where ϵ is the desired error threshold then break. 8: end for |
After determining the sparse factor ˆMkt, the nonzero entries' indices (support) of ˆMkt will point to the moving object pixels within the frame sequence during time interval [(t−1)F+1,tF]. Next, we generalize the pixel classification framework in the presence of multiple objects.
In the presence of multiple objects in a frame sequence the sparse factorization formulation in Eq (3.7) can be used by introducing multiple columns in Mt and employing the same coordinate descent process described earlier. One challenge in the presence of multiple objects is the correlation among objects that have similar pixel intensities and/or texture. In this case, the sparse factorization framework may return sparse factors ˆMt that contain nonzero values in entries corresponding to pixels of more than one objects. Thus, it may be necessary to do some extra clustering among these pixels to separate them according to the object they correspond to. This process will enable to split the objects that may appear in the same sparse factor and enable us to track them individually.
To split the objects that may be present in a sparse factor returned by the sparse factorization algorithm we rely on the property that pixels corresponding to the same object present in a sparse factor ^Mt should be neighboring and thus closer (in terms of Euclidean distance) compared to pixels corresponding a different object (placed at a different part of the frame).
Let Pt denote the nonzero entries of ˆMkt which indicates the moving objects' pixels, from which the coordinates zi for each pixel i∈Pt can be further extracted. Then, we employ K-means clustering, see [10], aiming at partitioning the Pt pixels into Zt clusters {Ξ1,…,ΞZt} according to the similarity of their corresponding coordinates zi,i∈Pt. K-means clusters the pixels by minimizing the following formulation
argminΞjZt∑j=1∑zi∈Ξj‖zi−ξj‖2, | (3.12) |
where ξj corresponds to the centroid of cluster Ξj.
In this way, Pt pixels will be clustered into Zt clusters, centered at ξj,j=1,…,Zt corresponding to the different moving objects contained within a sparse factor obtained via Alg. 1. If two clusters' centroid coordinates are too close then:
‖ξj−ξj′‖2≤ϵd, | (3.13) |
where ϵd is a predefined distance, we will decrease the initial number of clusters to Zt−1. By setting an upper limit Mup on the number of moving objects, we can adjust the required number of clusters Zt in the aforementioned way to accurately estimate the unknown number of video objects.
Once the pixels Ptm corresponding to object m have been determined, each object's centroid pixel can be determined as described earlier and Kalman filtering will be utilized to accurately track the location of each detected object within the video sequence. Recall that the state vector sm(t) contains the location coordinates, as well as the velocity at which the object's centroid is moving along each of the two dimensions present in each frame. It should be emphasized that there may be some errors when clustering the pixels according to the objects they belong to, in which case let ˆPtm denote the estimated pixel locations corresponds to object m, while ˆctm corresponds to the corresponding estimate of the object's centroid pixel. The following measurement model can be utilized to associate ˆctm with the state vector sm(t) as follows
ˆctm=H(t)sm(t)+wm(t)=ctm+wm(t), | (4.1) |
with m=1,…,M, where
H(t)=[10000100], |
while wm(t) corresponds to the localization error that may be present in ˆctm when utilizing the sparse factorization approach in Sec. Ⅲ. It is assumed that the noise wm(t) is zero mean with variance σ2w⋅I2×2. After numerical testing, the noise standard deviation σw is found to be less than 3 pixels.
Although, the distribution of the noise wm(t) is unknown in (4.1) and not necessarily Gaussian, the Kalman filter will still provide the linear minimum mean-square estimation for the state and observation models in (2.1) and (4.1). The object state estimator and corresponding error covariance matrix, obtained by the Kalman filter for object m are denoted here as ˆsm(t|t) and Pm(t|t), respectively. The prediction step in the Kalman filter used here, see e.g. [17], involves the following updating recursions for the state estimator and corresponding covariance
ˆsm(t|t−1)=Aˆsm(t−1|t−1) | (4.2) |
ˆPm(t|t−1)=AˆPm(t−1|t−1)FT+Σu. | (4.3) |
The estimated centroid ˆctm will then be used to carry out the correction step of the Kalman filter which involves the following updating recursions:
ˆsm(t|t)=ˆsm(t|t−1)+Gm(t)⋅[ˆctm−H(t)ˆsm(t|t−1)]Pm(t|t)=(I−Gm(t)H(t))Pm(t|t−1) | (4.4) |
for m=1,…,M), while the matrix Gm(t) which corresponds to the Kalman gain can be evaluated as
Gm(t)=Pm(t|t−1)HT(t)(σ2w⋅I2×2+H(t)Pm(t|t−1)HT(t))−1. | (4.5) |
A separate Kalman filter is implemented for each of the M objects determined using Alg. 1. Each of these filters is using the estimated centroid ˆctm found at every time instant t.
One may notice that the proposed matrix decomposition scheme may involve high complexity computations in the initialization stage when determining Mt, especially when the video resolution is very high leading to a large number of pixels per frame. To deal with this issue, we resort to a divide and conquer strategy. We split the fxfy×fxfy kernel covariance matrix into smaller parts corresponding to smaller regions of a frame with size ϱ×ϱ, where ϱ≪fxfy.
For each of these smaller regions we obtain ˆMjt for j=1,…,J using Alg. 1 on the kernel covariance matrix ˆΣϕϕxj,t that corresponds to subframe xtj that subsequently corresponds to a smaller region of the frame xt at time instant t, and J=fxfyϱ2. Then, the smaller sparse factors ˆMjt are stacked as follows
ˆMt=[{ˆM1t}T,…,{ˆMJt}T]T, | (5.1) |
to construct the sparse factor ˆMt corresponding to the kernel covariance matrix ˆΣϕϕx,t of the entire frame xt. It is worth noting that ˆMt is acquired without the need of factorizing the much larger in size fxfy×fxfy kernel covariance matrix ˆΣϕϕx,t. Proceeding as before, the nonzeros entries of ˆMt can be utilized to estimate the Pt object pixels. After implementing the K-means clustering method on the Pt pixels, cluster centroids will serve as the initialization positions of the objects in the filtering stage. Further, the pixel subsets Ptm are also used to estimate the width and height of a rectangular subframe, say width w0s and height h0s, that surrounds the pixels of each moving object in the frame and gives a rectangular area that estimates the objects' location.
Next, it is outlined how the Kalman filter and the sparse kernel factorization algorithm interact with each other to track multiple moving objects in a given video sequence. During the start-up stage, a number of Fs frames will be utilized to evaluate the kernel covariance matrix. After applying Alg. 1 using the divide and conquer implementation in Sec. (5.1), the nonzero entries in the acquired sparse vector ˆM0 will point to the pixels of the moving objects in the video.
From ˆM0, we can extract the pixels which form the detected moving objects. Here the size of the estimated rectangular area surrounding the object (cf. Sec. 5.1), namely w0s and h0s will be increased to ws and hs which satisfies ws≥w0s, and mod(ws,ϱ)=0, and hs≥h0s, and mod(hs,ϱ)=0. This results a slightly larger rectangular region for each object in which smaller regions of size ϱ×ϱ can be further extracted. In the next time instance, we will just incorporate the pixels around the predicted centroid position ˆsm(t|t−1) from Kalman filter Eq (4.2) to form the object kernel covariance matrix ˆΣϕϕx,t,m following Eq (3.1) in which x contains the pixels xi,j with x- and y- coordinates within the intervals
[ˆsm(t|t−1)]1−ws/2≤i≤[ˆsm(t|t−1)]1+ws/2[ˆsm(t|t−1)]2−hs/2≤j≤[ˆsm(t|t−1)]2+hs/2. | (5.2) |
Similarly to the initialization stage, the divide and conquer implementation is carried out in the kernel covariance matrix ˆΣϕϕx,t,m for each object separately to acquire the pixels which corresponds to the moving object m at the current time instant. This approach reduces the computational complexity since each object allocated areas is further split into smaller regions.
To account for newborn or disappearing objects in the video resulting a time-varying number of objects we can periodically generate and factorize the kernel covariance matrix in the entire frame. The complete scheme is outtlined as Alg. 2.
Algorithm 2 Joint Multi-Object Detection and Tracking |
1: Start-up stage (t=0)/Reconfiguration (mod(t,Tc)=0): For Fs consecutive frames, the kernel covariance matrix is formed according to Eq (3.5) and Algorithm 1 is applied in the entire frame to determine moving objects in the input frame sequence. 2: for t=1,2,…, do 3: Gather frames within time interval [(t−1)⋅F+1,t⋅F]. 4: Using the F acquired frames form the kernel covariance matrix Σϕϕx,t,m with pixels in a rectangular region of size ws×hs with centroid [ˆsm(t|t−1)]1:2 for m=1,…,M. 5: Apply Algorithm 1 to Σϕϕx,t,m and acquire the nonzero entries in ˆMmt. 6: Apply Kalman filtering for each of the M objects, i.e., Eq (4.2)-(4.5) to track each of the M objects' centroid pixel. 7: end for |
The performance of the proposed scheme is first tested on a synthetic frame sequence that contains three rectangular objects that move independently from each other. Video background contains randomly generated Gaussian zero-mean noise with variance 20, while the objects consist of pixels with intensity varying between 240 to 255. The synthetic video contains 60 frames, during which object 1 moves from the left to the right, object 2 moves from the top to the bottom of the frame and object 3 moves from the right to the left. The frame size is 100 by 100, while all the objects are of size 10 by 10. The true coordinates of each object's centroid pixel are recorded to evaluate our proposed tracking scheme. To select a proper kernel variance, firstly we empirically choose a variance range [0.001,1], by checking the kernel covariance matrix formed with different variances in the variance range, we select the variance value which results the desired block diagonal structure in the kernel covariance matrix. The value is set as σ2=0.01. Three sample images are given in Figure 2, 3, 4, with frame indices 20, 40, and 60. The objects are marked out by a dark-colored circle to show how our method captures the momentary location of each of the rectangular moving objects. The tracking root mean-squared error (RMSE) which quantifies the number of pixels by which the estimated centroid pixel coordinates is 'missing' the true centroid of each object is depicted in Figure 5. It is clear that both the proposed tracking scheme and the scheme in [34] localizes the moving objects within 2 pixels of accuracy on average. Though, the approach in [34] requires prior knowledge of the objects' initial coordinates, and a proper search window size which our scheme can generate in an unsupervised manner and without the need of any prior information.
Next, the proposed tracking scheme is tested on video sequences extracted from the datasets available on the OTCBVS website [21], where a Raytheon L-3 Thermal-Eye 2000AS infrared sensor is utilized to acquire 8-bit grayscale images (with a resolution of 320×240 pixels per frame). The first image sequence is extracted from the OTCBVS dataset 05, i.e, terravic motion infrared database. In this video sequence, a man with a weapon moves from the left to the right with deformation. Six sample frames are presented in Figure 6, 7, 8. Even though the shape of the target varies with the non-rigid movement of the limbs, our novel algorithm is able to detect the person and tracks the target accurately (the white box surrounding the objects denotes the estimated area where the algorithm projects there is an object). Frames 243 and 282 are zoomed in to show the accuracy of the bounding box our proposed tracking scheme generated; see Figure 9, 10.
Another experiment is conducted on another video sequence extracted from the OTCBVS database. In this sequence, there are two pedestrians one of which moves from the left to the right, while the other pedestrian moves from the right to the left. Six sample frames are displayed in Figure 11, 12, 13. For a better view of how our proposed tracking scheme manages to localize both pedestrians, frames 251 and 360 are zoomed in and displayed in Figure 14, 15, 16, and 17. The tracking RMSE for our proposed method and the tracking scheme in [34] is compared in Figure 18. It can be seen that our tracking scheme outperforms the scheme in [34] for both pedestrians, and it is worth noting that for our scheme, the average tracking error for most of the tracking time is below 4 pixels.
Further, we compare our novel object detection scheme with a simple approach where objects are detected by thresholding the pixel values. The result is displayed in Table. 1. One can observe that the thresholding approach is very sensitive to the selection of the threshold and it may detect the wrong number of objects. Thresholding may miss objects whose pixels have intensity less than the chosen threshold or declare as objects background artifacts whose pixel intensity is larger than the threshold. However, our scheme is not affected by background noise and is capable of finding the correct number of objects.
Novel object detection | Pixel thresholding | |
Synthetic video sequence | 3 | 3 |
Thermal video sequence 1 | 1 | 2 |
Thermal video sequence 2 | 2 | 1 |
Oftentimes, videos may be corrupted due to camera or storage issues, resulting missing pixels in the video frames. Here, our tracking scheme is tested in the scenario where a portion of the frame pixels are missing (their intensity is set to 0); see Figure 19 and 20. Here, two sample frames 224,252 with a 5% random pixel loss are provided to display the tracking result. Despite the missing pixels our approach is able to track the objects. For the pedestrian on the right, despite the loss of several pixels, our tracking scheme enables precise localization. Notice that the tracking performance of the pedestrian on the left is not as good as the pedestrian on the right, since the left side pedestrian occupies a smaller number of pixels which results a larger portion of the object to disappear in the presence of missing pixels.
A novel multi-object detection and tracking algorithm was put forth in video sequences. The task of identifying objects in a sequence of frames was transformed in a sparse kernel covariance factorization problem, where the support of the estimated sparse factors point to the pixels of each object present in a frame. To this end, a sparsity-aware kernel covariance matrix factorization scheme, based on norm-1 regularization was proposed and minimized utilizing a coordinate descent approach. After objects are successfully determined, Kalman filtering is implemented cooperatively with the sparse kernel covariance factorization scheme to allow accurate tracking of each object's centroid pixels. Numerical tests on different video datasets validate the effectiveness of our proposed video tracking mechanism in the presence of multiple objects, and corroborate the improved tracking performance over existing alternatives.
Work in this paper was supported by the AFOSR Grant FA9550-15-1-0103 and the NSF grant 1509780.
10. Proof of Eq (3.10), (3.11)
Let Mt(j)=h, while setting the rest of the minimization variables in (3.7) to their most up-to-date values at the end of cycle k−1. It follows that ˆMkt(j) is the minimizer of
argminhh4+c1⋅h2+c2⋅h+λt,s. to |h|≤t, | (10.1) |
where
c1=2p∑i,i≠j[ˆMk−1t(i)]2−2δk(j,j)+ϕ, and | (10.2) |
c2=−4p∑i,i≠jδk(j,i)ˆMk−1t(i). | (10.3) |
After evaluating the derivatives of the cost in (10.1) wrt h and t and applying the Karush-Kuhn-Tucker optimality conditions [2] it follows that h∗:=ˆMkt(j) should satisfy 4(h∗)3+2c1h∗+c2+μ∗1−μ∗2=0 and −μ∗1−μ∗2+λ=0, where μ∗1 and μ∗2 are the optimal multipliers corresponding to the inequality constraints of (10.1). Note that μ∗1≥0,μ∗2≥0, while the complementary slackness conditions impose that μ∗1(h∗−t∗)=μ∗2(−t∗−h∗)=0. If h∗>0 the slackness conditions imply that μ∗2=0 from which it follows that μ∗1=λ. Substituting the latter values in 4(h∗)3+2c1h∗+c2+μ∗1−μ∗2=0 gives (3.10). Similarly, the negative candidate minimizers of (10.1) can be obtained by the roots of (3.11).
11. Convergence of Alg. 1:
Let ℓ({Mt(j)}pj=1) denote the cost in(3.7) which is defined over Rp×1, and let us define
ℓ0({Mt(j)}pj=1,):=p∑j=1p∑j′=1[ˆΣϕx,t(j,j′)−Mt(j)Mt(j′)]2. |
Further, consider the following level set:
L0t:={{Mt(j)}pj:ℓ({Mt(j)}pj=1)≤ℓ(ˆM0t)}, | (11.1) |
where ˆM0t is the p×1 matrix used to initialize Alg. 1 and selected such that ‖ˆM0t‖1<∞, from which it follows that ℓ(ˆM0t)<∞. Then, from (11.1) and the form of ℓ(⋅) it follows that the member matrices Mt of H0t satisfy
p∑j=1λℓ|Mt(j)|≤ℓ(ˆM0t)<∞. |
Thus, the level set L0 is closed and bounded (compact). Also, ℓ(⋅) is continuous on L0.
Recall from [cf. (10.1)] that the cost involved in updating ˆMkt(j) can be written as Jkt(j):=h4+c1h2+c2h+λ|h|. If c2≠0 then after determining the monotonicity of Jkt(j), it has a unique minimizer. And if c2=0, then Jk(j) is symmetric around zero. In that case if c1>0 then the unique minimizer of Jkt(j) is 0. Though, if c1<0 then Jkt(j) has two minimizers with the same magnitude but different sign. In that case we can consistently select the positive (or negative) minimizer ensuring a unique minimizer per iteration. Function ℓ(⋅) satisfies the regularization conditions outlined in [32, (A1)]. In detail, the domain of ℓ0(⋅) is formed by matrices whose entries satisfy Mt(j)∈(−∞,+∞). Then, domain(ℓ0)=(−∞,∞)p×1 is an open set. Further, ℓ0(⋅) is Gˆateaux differentiable over domain(ℓ0). The Gˆateaux derivative is
ℓ′0(M;ΔM):=limϵ→0[ℓ0(M+ϵΔM)−ℓ0(M)]/ϵ. | (11.2) |
After carrying out the necessary algebraic operations it follows readily that ℓ′0(M;ΔM) exists for all ΔM∈domain(ℓ0), and it equals
−2tr[(ˆΣx,t−MtMTt)(MtΔTM+ΔMMTt)]+1T(Mt⊙ΔM)1. |
The aforementioned properties ensure that Alg. 1 iterates converge to a stationary point of ℓ(⋅) [32, Thm. 4.1 (c)].
[1] | Bar-Shalom Y, (2001) EstimationWith Applications to Tracking and Navigation. New York: Wiley. |
[2] | Bertsekas DP, (2003) Nonlinear Programming, Second Edition, Athena Scientific. |
[3] |
Black MJ, Jepson AD (1998) Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. Int J Comput Vision 26: 63–84. doi: 10.1023/A:1007939232436
![]() |
[4] | Chen IK, Hsu SL, Chi CY, et al. (2014) Automatic video segmentation and object tracking with real-time RGB-D data Proc of the IEEE Intl Conf on Consum Electr: 486–487. |
[5] | Cielniak G, Duckett T, Lilienthal AJ (2007) Improved data association and occlusion handling for vision-based people tracking by mobile robots. Proc of IEEE Intl Conf on Intell Robot Syst. |
[6] |
Davis JW, Sharma V (2007) Background-subtraction using contour-based fusion of thermal and visible imagery. Comput Vis Image Und 106: 162–182. doi: 10.1016/j.cviu.2006.06.010
![]() |
[7] | Ennulat RD, Pommerrenig D (1988) Uncooled high resolution infrared imaging plane. |
[8] | Hanif M, Ali U (2006) Optimized visual and thermal image fusion for e cient face recognition. Proc of IEEE Intl Conf on Inform Fusion. |
[9] | Harchaoui Z, Bach F (2007) Image classification with segmentation graph kernels. Proc of IEEE Conf on Comput Vis and Pattern Recogn. |
[10] | Hartigan JA, Wong MA (1979) Algorithm AS 136: A K-means clustering algorithm. J R Stat Soc C-Appl 28: 100–108. |
[11] | Heo J, Kong SG, Abidi BR, et al. (2004) Fusion of visual and thermal signatures with eyeglass removal for robust face recognition. Proc of IEEE Conf on Comput Vis and Pattern Recogn Workshop, 122–122. |
[12] | Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat, 1171–1220. |
[13] | Horn RA, (1985) Matrix Analysis. Cambridge, U.K.: Cambridge Univ. Press. |
[14] | Isola P, Zoran D, Krishnan D, et al. (2014) Crisp boundary detection using pointwise mutual information. European Conference on Computer Vision, 799–814. |
[15] | Jiang M, Pan Z, Tang Z (2017) Visual object tracking based on Cross-modality Gaussian-Bernoulli deep Boltzmann machines with RGB-D sensors. Sensors 121. |
[16] |
Kang K, Maroulas V, Schizas I, et al. (2018) Improved distributed particle filters for tracking in wireless sensor network. Comput Stat Data An 117: 90–108. doi: 10.1016/j.csda.2017.07.009
![]() |
[17] | Kay SM, (1993) Fundamental of Statistical Signal Processing: Estimation Theory, Prentice Hall. |
[18] |
Kwak N (2012) Kernel discriminant analysis for regression problems. Pattern Recogn 45: 2019–2031. doi: 10.1016/j.patcog.2011.11.006
![]() |
[19] | Luber M, Spinello L, Arras KO (2011) People tracking in RGB-D data with on-line boosted target models. Proc of IEEE Intl Conf on Intell Robot Syst. |
[20] |
Maroulas V, Stinis P (2012) Improved particle filters for multi-target tracking. J Comput Phys 231: 602–611. doi: 10.1016/j.jcp.2011.09.023
![]() |
[21] | IEEE OTCBVS WS Series Bench; Roland Miezianko, Terravic Research Infrared Database. Available: http://vcipl-okstate.org/pbvs/bench/ |
[22] | Padole CN, Alexandre LA (2010) Motion based particle filter for human tracking with thermal imaging. Proc of IEEE Intl Conf on Emerging Trends in Engineering and Technology (ICETET): 158–162. |
[23] | Peng J, Zhou Y, Chen CLP (2015) Region-kernel-based support vector machines for hyperspectral image classification IEEE T Geosci Remote 53: 4810–4824. |
[24] | Ren G, Maroulas V, Schizas ID (2015) Distributed Sensors-Targets Spatiotemporal Association and Tracking. IEEE Taes 51: 2570–2589. |
[25] |
Ren G, Maroulas V, Schizas ID (2016) Decentralized Sparsity-Based Multi-Source Association and State Tracking. Signal Process 120: 627–643. doi: 10.1016/j.sigpro.2015.10.013
![]() |
[26] | Ren G, Maroulas V, Schizas ID (2016) Exploiting sensor mobility and covariance sparsity for distributed tracking of multiple targets. J Adv Sig Pr 2016: 53. |
[27] |
Rosipal R, Girolami M, Trejo LJ, et al. (2001) Kernel PCA for feature extraction and de-noising in nonlinear regression. Neural Comput Appl 10: 231–243. doi: 10.1007/s521-001-8051-z
![]() |
[28] |
Schizas ID (2013) Distributed informative-sensor identification via sparsity-aware matrix factorization. IEEE Trans on Sig Proc 61: 4610–4624. doi: 10.1109/TSP.2013.2269044
![]() |
[29] |
Teichman A, Lussier JT, Thrun S (2013) Learning to segment and track in RGBD. IEEE T Autom Sci Eng 10: 841–852. doi: 10.1109/TASE.2013.2264286
![]() |
[30] | Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B 58: 267–288. |
[31] | Treptow A, Cielniak G, Duckett T (2005) Active people recognition using thermal and grey images on a mobile security robot. Proc of IEEE Intl Conf on Intell Robot Syst. |
[32] |
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Opt Theory App 109: 475–494. doi: 10.1023/A:1017501703105
![]() |
[33] | Vert JP, Tsuda K, Schölkopf B (2004) A primer on kernel methods. Kernel Methods in Comput Biol: 35–70. |
[34] | Zhang K, Zhang L, Yang and M (2012) Real-time compressive tracking. European Conference on Computer Vision: 864–877. |
[35] | Xu F, Liu X, Fujimura K (2005) Pedestrian detection and tracking with night vision. IEEE T Intell Transp: 63–71. |
[36] | Yasuno M, Yasuda N, Aoki M (2005) Pedestrian detection and tracking in far infrared images. IEEE Conf on Intell Transport S: 131–136. |
[37] | Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat: 15. |
Novel object detection | Pixel thresholding | |
Synthetic video sequence | 3 | 3 |
Thermal video sequence 1 | 1 | 2 |
Thermal video sequence 2 | 2 | 1 |