This paper studied panel interval-valued data models with individual fixed effects, in which the correlation within a group was considered and the group average method was used to eliminate the fixed effects. Then, we applied generalized estimation equations (GEEs) to analyze panel interval-valued data models and gave a computational algorithm to obtain the estimators. Some Monte Carlo simulations and real data analysis showed that, in contrast with the least-squares dummy-variable (LSDV) method, the proposed GEEs method has advantages in forecasting performance.
Citation: Chi Liu, Ruiqin Tian, Dengke Xu. Generalized estimation equations method for fixed effects panel interval-valued data models[J]. Electronic Research Archive, 2025, 33(6): 3733-3755. doi: 10.3934/era.2025166
Related Papers:
[1]
Dengke Xu, Linlin Shen, Yuanyang Tangzhu, Shiqi Ke .
Bayesian analysis of random effects panel interval-valued data models. Electronic Research Archive, 2025, 33(5): 3210-3224.
doi: 10.3934/era.2025141
[2]
Jongho Kim, Woosuk Kim, Eunjeong Ko, Yong-Shin Kang, Hyungjoo Kim .
Estimation of spatiotemporal travel speed based on probe vehicles in mixed traffic flow. Electronic Research Archive, 2024, 32(1): 317-331.
doi: 10.3934/era.2024015
[3]
Sadia Anwar, Showkat Ahmad Lone, Aysha Khan, Salmeh Almutlak .
Stress-strength reliability estimation for the inverted exponentiated Rayleigh distribution under unified progressive hybrid censoring with application. Electronic Research Archive, 2023, 31(7): 4011-4033.
doi: 10.3934/era.2023204
[4]
Ke Liu, Hanzhong Liu .
Testing for individual and time effects in unbalanced panel data models with time-invariant regressors. Electronic Research Archive, 2022, 30(12): 4574-4592.
doi: 10.3934/era.2022232
[5]
Yongzhong Hu .
An application of the Baker method to a new conjecture on exponential Diophantine equations. Electronic Research Archive, 2024, 32(3): 1618-1623.
doi: 10.3934/era.2024073
[6]
Xinyi Xu, Shaojuan Ma, Cheng Huang .
Uncertainty prediction of wind speed based on improved multi-strategy hybrid models. Electronic Research Archive, 2025, 33(1): 294-326.
doi: 10.3934/era.2025016
[7]
Jinheng Liu, Kemei Zhang, Xue-Jun Xie .
The existence of solutions of Hadamard fractional differential equations with integral and discrete boundary conditions on infinite interval. Electronic Research Archive, 2024, 32(4): 2286-2309.
doi: 10.3934/era.2024104
[8]
E. A. Abdel-Rehim .
The time evolution of the large exponential and power population growth and their relation to the discrete linear birth-death process. Electronic Research Archive, 2022, 30(7): 2487-2509.
doi: 10.3934/era.2022127
[9]
Zhenghui Li, Hanzi Chen, Siting Lu, Pierre Failler .
How does digital payment affect international trade? Research based on the social network analysis method. Electronic Research Archive, 2024, 32(3): 1406-1424.
doi: 10.3934/era.2024065
[10]
J. Vanterler da C. Sousa, Kishor D. Kucche, E. Capelas de Oliveira .
Stability of mild solutions of the fractional nonlinear abstract Cauchy problem. Electronic Research Archive, 2022, 30(1): 272-288.
doi: 10.3934/era.2022015
Abstract
This paper studied panel interval-valued data models with individual fixed effects, in which the correlation within a group was considered and the group average method was used to eliminate the fixed effects. Then, we applied generalized estimation equations (GEEs) to analyze panel interval-valued data models and gave a computational algorithm to obtain the estimators. Some Monte Carlo simulations and real data analysis showed that, in contrast with the least-squares dummy-variable (LSDV) method, the proposed GEEs method has advantages in forecasting performance.
1.
Introduction
In the real world, there are a large number of interval-valued data. For example, in the stock market, the K-line chart, as an important decision-making tool, includes the opening price, closing price, maximum, and minimum of the day. In temperature prediction, the minimum and maximum temperatures can be regarded as an interval. On the other hand, due to the complexity of the system, the data obtained is an interval rather than a real point. For example, experts often predict the economic growth rate of the next year in the form of an interval, such as 4–5%. As far as we know, interval-valued data can provide a tool for dealing with large datasets, which were proposed by Diday [1]. It not only contains richer information than point data but also has better interpretability. So, it is meaningful to analyze interval-valued data. Billard and Diday [2] proposed the first algorithm to fit the interval-valued linear regression, which fits the linear regression model to the midpoint of the interval values and obtains the parameters by minimizing the midpoint error. Billard and Diday [3] proposed the min-max method, which defined two models, each corresponding to a response bound. The lower bounds of the response depend on the lower bounds of the regressive variables, and the upper bounds of the response depend on the upper bounds of the regressive variables. Lima Neto et al. [4] introduced the center and range Method (CRM) and proposed two linear models: one for the midpoint and another for the range. Kong and Gao [5] proposed a regularized interval MM estimate (RIMME) for interval-valued regression, which can achieve a good balance between the prediction accuracy and mathematical coherence of the predicted intervals. Xu and Qin [6] proposed a bivariate Bayesian regression model based on CRM with known or unknown covariance matrices. Beyaztas et al. [7] introduced the functional forms of some well-known regression models that take interval-valued data. Interval analysis technologies can also be found in other literature [8,9,10,11,12]. To sum up, it is not difficult to find that the above research on interval-valued data is mainly based on cross-sectional data.
Panel data or longitudinal data is obtained by multiple observations of individuals at different times. Statistical models that combine cross-section and time series real value data are becoming more and more popular in economic research, and panel data sets have certain advantages over traditional pure cross-section or pure time series data sets. Galvao Jr. [13] studied a quantile regression dynamic panel model with fixed effects. Aristodemou [14] studied semiparametric identification in linear index discrete response panel data models with fixed effects. Hamiye Beyaztas and Bandyopadhyay [15] proposed a new, weighted likelihood-based robust estimation procedure for linear panel data models with fixed and random effects. Bai et al. [16] studied generalized least squares (GLS) estimation for linear panel data models. Baltagi and Li [17] focused on prediction in spatial models based on panel data. Elhorst [18] studied the specification and estimation of spatial panel data models. Zhang et al. [19] studied a penalized quantile regression for spatial panel models with fixed effects. In particular, in recent years, a few scholars have begun to study interval panel data models. For example, Ji et al. [20] introduced fixed effects panel data regression models for interval-valued data, and presented three kinds of panel interval-valued data regression models and the estimation of their parameters. Zhang et al. [21] proposed a robust estimation method based on the iterative weighted least squares technique to reduce the impact of outliers on models. Considering the correlation between the center and range or the upper and lower bounds of intervals, Ji et al. [22] proposed the bivariate maximum likelihood (BML) method for estimating unknown parameters of the model. Usually, the observation data of different individuals are independent of each other, but there is correlation between the observation data of the same individual at different times. As far as we know, current research on panel interval-valued data has not considered the correlation within a group of the data.
Generalized estimation equations (GEEs) proposed by Liang and Zeger [23] are commonly used to analyze the longitudinal data with correlation within individuals. Crowder [24] studied the (weak) consistency and inconsistency of the solutions of general estimating equations. Balan and Schiopu-Kratina [25] studied the consistency and asymptotic normality of the GEEs estimation, with the covariate dimension fixed. Wang [26] studied the consistency and asymptotic normality of GEEs estimates when the sample size of the covariate dimension tends to infinity. Wang and Carey [27] showed that the discrepancy between the working correlation structure and the true correlation structure and the estimation method of the correlation coefficient affect the asymptotic relative efficiency. Pan [28] proposed a new model selection criterion that minimizes the expected predictive bias of estimating equations. Wang [29] presented a systematic review on GEEs, covering foundational concepts as well as several recent developments due to practical challenges in real applications. Seaman and Copas[30] described doubly robust (DR) GEEs, and illustrated their use on simulated data. To sum up, previous studies have only applied GEEs to point value data. Therefore, we intend to apply it to panel interval-valued data in this paper.
In this paper, unlike the independent hypothesis of other studies, we consider the correlation within a group of the panel interval-valued data to obtain better parameter estimates. GEEs are commonly used to analyze the longitudinal data with correlation within individuals. We intend to apply it to panel interval-valued data in this paper. The results of Monte Carlo simulations reveal that our proposed method outperforms the LSDV method based on independent assumptions in terms of estimation accuracy and robustness. When the working correlation matrix is assumed correctly, the GEEs method performs better than the LSDV method. When the working correlation matrix is assumed incorrectly, the performance of the two methods is similar. In addition, the higher the degree of correlation within the group, the better the performance of the GEEs method. The empirical results also demonstrate the superiority of our proposed method.
The rest of this paper are organized as follows: Current regression models for fixed effects panel interval-valued data and the LSDV method are provided in Section 2. The regression method based on the generalized estimation equations is presented in Section 3. In Section 4, some Monte Carlo simulations and real data analysis show that, in contrast with the least-squares dummy-variable (LSDV) method, the proposed GEEs method has advantages in forecasting performance. Finally, Section 5 gives the conclusion and discussion.
2.
Fixed effects panel interval-valued data models
In this section, the LSDV method and three existing kinds of fixed effects panel interval-valued data models are introduced: the min-max (Min-Max or MM) model, the center and range (CRM) model, the center (CM) model. These three kinds models hold their own advantages and disadvantages.
For the panel interval-valued dataset Z={(Xit,yit)|i=1,2,...,N;t=1,2,...,T}, let Xit=(xit1,xit2,...,xitp), where xitk=[xlitk,xuitk],yit=[ylit,yuit],xcitk=xlitk+xuitk2,xritk=xuitk−xlitk2,ycit=ylit+yuit2,yrit=yuit−ylit2,k=1,2,...,p. xlitk(xuitk,xcitk or xritk) are considered as explanatory variables and ylit(yuit,ycit or yrit) are considered as response variables. In addition, let uli(uui) be the individual effect and uci=uui+uli2,uri=uui−uli2.
2.1. The min-max model
The min-max model of fixed effects panel interval-valued data regression consists of two different models. Two regression models are established for the upper and lower bounds of the response variable by using the upper and lower bounds of the explanatory variables. The specific form is given as follows:
ylit=uli+p∑j=1βljxlitj+ϵlit,
(2.1)
yuit=uui+p∑j=1βujxuitj+ϵuit,
(2.2)
where the errors ϵlit∼N(0,σ2), ϵuit∼N(0,σ2),i=1,2,...,N;t=1,2...,T.
Based on the lower bounds of yit and xitj(i=1,2...,N;t=1,2,...,T;j=1,2,...,p), the parameters uli and βl=(βl1,βl2,...,βlp)T are estimated by the least-squares dummy-variable (LSDV) method:
ˆβl=(ˆβl1,ˆβl2,...,ˆβlp)T=(Al)−1Bl,
ˆuli=¯yli−p∑j=1ˆβlj¯xli.j,
where Al=∑Ni=1∑Tt=1(Xlit−¯Xli)(Xlit−¯Xli)T,Bl=∑Ni=1∑Tt=1(Xlit−¯Xli)(ylit−¯yli)T,Xlit=(xlit1,xlit2,...,xlitp)T,¯Xli=1T∑Tt=1Xlit,¯yli=1T∑Tt=1ylit,¯xli.j=1T∑Tt=1xlitj.
The estimates of βu and uui have the following similar conclusion:
ˆβu=(ˆβu1,ˆβu2,...,ˆβup)T=(Au)−1Bu,
ˆuui=¯yui−p∑j=1ˆβuj¯xui.j,
where Au=∑Ni=1∑Tt=1(Xuit−¯Xui)(Xuit−¯Xui)T,Bu=∑Ni=1∑Tt=1(Xuit−¯Xui)(yuit−¯yui)T,Xuit=(xuit1,xuit2,...,xuitp)T,¯Xui=1T∑Tt=1Xuit,¯yui=1T∑Tt=1yuit,¯xui.j=1T∑Tt=1xuitj.
For a new observation Xit=(xit1,xit2,...,xitp)T, with xitj=[xlitj,xuitj], the prediction of ˆyit=[ˆylit,ˆyuit] is given by:
ˆylit=ˆuli+p∑j=1ˆβljxlitj,
ˆyuit=ˆuui+p∑j=1ˆβujxuitj.
Sometimes the min-max model does not guarantee the mathematical coherence of the predicted interval bounds, and then the response variables are defined as follows:
ˆyit=[ˆylit,ˆyuit]={[ˆylit+ˆyuit2,ˆylit+ˆyuit2] if ˆylit>ˆyuit;[ˆylit,ˆyuit] if ˆylit≤ˆyuit.
2.2. The center and range model
The center and range model of fixed effects panel interval-valued data regression is as follows:
ycit=uci+p∑j=1βcjxcitj+ϵcit,
(2.3)
yrit=uri+p∑j=1βrjxritj+ϵrit,
(2.4)
where the errors ϵcit∼N(0,σ2), ϵrit∼N(0,σ2),i=1,2,...,N;t=1,2...,T. The estimates of βc and uci given by the LSDV method are:
ˆβc=(ˆβc1,ˆβc2,...,ˆβcp)T=(Ac)−1Bc,
ˆuci=¯yci−p∑j=1ˆβcj¯xci.j,
where Ac=∑Ni=1∑Tt=1(Xcit−¯Xci)(Xcit−¯Xci)T,Bc=∑Ni=1∑Tt=1(Xcit−¯Xci)(ycit−¯yci)T,Xcit=(xcit1,xcit2,...,xcitp)T,¯Xci=1T∑Tt=1Xcit,¯yci=1T∑Tt=1ycit,¯xci.j=1T∑Tt=1xcitj.
Similarly, we can obtain the estimates of βr and uri :
ˆβr=(ˆβr1,ˆβr2,...,ˆβrp)T=(Ar)−1Br,
ˆuri=¯yri−p∑j=1ˆβrj¯xri.j,
where Ar=∑Ni=1∑Tt=1(Xrit−¯Xri)(Xrit−¯Xri)T,Br=∑Ni=1∑Tt=1(Xrit−¯Xri)(yrit−¯yri)T,Xrit=(xrit1,xrit2,...,xritp)T,¯Xri=1T∑Tt=1Xrit,¯yri=1T∑Tt=1yrit,¯xri.j=1T∑Tt=1xritj.
For a new observation Xit=(xit1,xit2,...,xitp)T, with xitj=[xlitj,xuitj], the predictions of ˆycit and ˆyrit are given by:
ˆycit=ˆuci+p∑j=1ˆβcjxcitj,
ˆyrit=ˆuri+p∑j=1ˆβrjxritj.
Then, the predictions of ˆylit and ˆyuit are as follows:
ˆyit=[ˆylit,ˆyuit]={[ˆycit−ˆyrit,ˆycit+ˆyrit] if ˆyrit≥0;[ˆycit,ˆycit] if ˆyrit<0.
2.3. The center model
Based on the center value of the panel interval-valued data, the liner regression model with fixed individual effects is as follows:
ycit=uci+p∑j=1βcjxcitj+ϵcit,
(2.5)
where the errors ϵcit∼N(0,σ2),i=1,2,...,N;t=1,2,...,T.
As described earlier, the estimates of βc and uci are given by the LSDV method:
ˆβc=(ˆβc1,ˆβc2,...,ˆβcp)T=(Ac)−1Bc,
ˆuci=¯yci−p∑j=1ˆβcj¯xci.j.
For a new observation Xit=(xit1,xit2,...,xitp)T, with xitj=[xlitj,xuitj], the prediction of ˆyit=[ˆylit,ˆyuit] is given by:
ˆylit=ˆuci+p∑j=1ˆβcjxlitj,
ˆyuit=ˆuci+p∑j=1ˆβcjxuitj.
3.
Fixed effects panel interval-valued data models based on the GEEs method
This section introduces generalized estimation equations(GEEs) for fixed effects panel interval-valued regression models that consider the data correlation within individuals.
For panel interval-valued data set Z={(Xit,yit)|i=1,2,...,N;t=1,2,...,T}, Xit=(xit1,xit2,...,xitp)T is assumed as p×1 interval-valued explanatory variables with xitj=[xlitj,xuitj], yit=[ylit,yuit] as the observed interval-valued dependent variables, i=1,2,...,N;t=1,2,...,T;j=1,2,...,p. ui is an unobserved effect for individual i and is time invariant. Let Yai=(yai1,yai2,...,yaiT)T,Xai=(Xai1,Xai2,...,XaiT)T, where a=l,u,c, or r. Let Ya=((Ya1)T,(Ya2)T,...,(YaN)T)T,Xa=((Xa1)T,(Xa2)T,...,(XaN)T)T,ua=(ua1,ua2,...,uaN)T.
3.1. The GEEs method of the min-max model
The matrix form of the min-max model of fixed effects panel interval-valued data regression is as follows:
Yl=Xlβl+(IN⊗τT)ul+ϵl,
(3.1)
Yu=Xuβu+(IN⊗τT)uu+ϵu.
(3.2)
To save space, we only present the estimation method of the model (3.1). Considering the correlation between the observations of the same individual at different times, and the independence between the observations of different individuals, we suppose that ϵl∼N(0,Σl),cov(ϵlik,ϵljl)=0(i≠j), where ϵl=(ϵl1,ϵl2,...,ϵlN)T,ϵli=(ϵli1,ϵli2,...,ϵliT),
There is a problem for us to find the estimation of βl in the model (3.1) because of the fixed effects ul. Specifically, simultaneously estimating ul and βl directly may incur "curse of dimensionality" since ul is high dimensional when N becomes too large. First, similar to Section 2.1, we use the projection techniques to eliminate fixed effects in the model (3.1). Specifically, let Q=INT−P, P=IN⊗(JT/T),JT=τTτ′T, where IN is an N×N identity matrix and τT is an N×1 vector of ones. Multiplying by Q on the left for model (3.1), we have
˜Yl=˜Xlβl+˜ϵl,
(3.3)
where ˜Yl=QYl,˜Xl=QXl,˜ϵl=Qϵl, Q(IN⊗τT)=0, and ˜ϵl∼N(0,QΣlQT),Q=diag(Q1,Q2,...,QN),Qi=IT−JT/T,
QΣlQT=(Q1Σl1QT10⋯00Q2Σl2QT2⋯0⋮⋮⋱⋮00⋯QNΣlNQTN).
Then, the generalized estimating equations (GEEs[23]) can be defined as follows:
N∑i=1(˜Xli)T(Vli)−1(˜Yli−(˜Xli)Tβ)=0,
(3.4)
where ˜Yli=QiYli=(˜yli1,…,˜yliT)T, ˜Xli=QiXli, Vli is the covariance matrix of ˜Yli. According to Liang and Zeger ([23]), we simplify the covariance of the ith subject Vi by taking Vli=(Ali)1/2Ri(α)(Ali)1/2, where Ali=diag{var(˜Yli1),…,var(˜YliT)}, and Ri(α) is a common working correlation with nuisance parameter α. In general, Vli is unknown, and we often use the empirical estimators ˆVli for the Vli. Specifically,
ˆVli=(Ali)1/2(^βl)Ri(ˆα)(Ali)1/2(^βl),
where ˆα can be obtained by following the method proposed by Liang and Zeger ([23]) in Section 3.4.1. Ali(^βl) is the empirical estimator of Ai based on the current estimator of βl.
Thus, we implement the iterative algorithm to solve Eq (3.4). Given the current estimate ˆα, we use the following iterative procedure for βl:
where ˆVui=(Aui)1/2(^βu(k))Ri(ˆα)(Aui)1/2(^βu(k)), and the estimates of ˆuu are given by
ˆuu=((IN⊗τT)T(IN⊗τT))−1(IN⊗τT)T(¯Yu−¯Xuˆβu),
(3.8)
where ¯Yu=PYu, ¯Xu=PXu.
For a new observation X=[Xl,Xu], the prediction ˆY=[ˆYl,ˆYu] is given by
ˆYl=Xlˆβl+(IN⊗τT)ˆul,
ˆYu=Xuˆβu+(IN⊗τT)ˆuu.
3.2. The GEEs method of the center and range model
The matrix form of the center and range model of fixed effects panel interval-valued data regression is as follows:
Yc=Xcβc+(IN⊗τT)uc+ϵc,
(3.9)
Yr=Xrβr+(IN⊗τT)ur+ϵr,
(3.10)
where ϵc∼N(0,Σc),ϵr∼N(0,Σr),Σc=Σl+Σu4=Σr.
Similar to Section 3.1, we only present the estimation method of model (3.9). To eliminate the fixed effects, let ˜Yc=QYc,˜Xc=QXc,˜ϵc=Qϵc, and Q(IN⊗τT)=0. Multiplying (3.9) by Q on the left gives the following intra-group regression model equation:
˜Yc=˜Xcβc+˜ϵc,
(3.11)
where ˜ϵc∼N(0,QΣcQT). Then, we can obtain the estimates of ˆβc by iterating the following ralationship to convergence:
where ˆVci=(Aci)1/2(^βc(k))Ri(ˆα)(Aci)1/2(^βc(k)), and the estimates of ˆuc are obtained as follows:
ˆuc=((IN⊗τT)T(IN⊗τT))−1(IN⊗τT)T(¯Yc−¯Xcˆβc),
where ¯Yc=PYc, ¯Xc=PXc.
For a new observation X=[Xl,Xu], the prediction ˆY=[ˆYl,ˆYu] is given by
^Yl=Xl^βc+(IN⊗τT)^uc,
^Yu=Xu^βc+(IN⊗τT)^uc.
3.4. Computation algorithm
3.4.1. The working correlation matrix
The following introduces the structures of several commonly used working correlation matrices [23,31,32]:
● Independence: For the independence working correlation matrix, it is assumed that all the response variables are mutually independent. α=0,Ri(α)=IT.
● Exchangeable: Suppose that the correlation between any two observed values of the same individual are the same. The form of the correlation matrix is as follows:
In this subsection, we present a feasible algorithm as follows. Consider the general form of the model: Y=Xβ+(IN⊗τT)u+ϵ.
Step 1: Use the matrix Q to eliminate fixed effects u. Then the initial value ˆβ(0) is obtained from the independent assumption.
Step 2: The estimates ˆα(k) and A1/2i(ˆβ(k)) are obtained based on the assumed structure of the correlation matrix as in Section 3.4.1. Then ˆV(k)i=A1/2i(ˆβ(k))Ri(ˆα)A1/2i(ˆβ(k)) is obtained. Update ˆβ(k+1)=ˆβ(k)+(∑Ni=1(˜Xi)T(^Vi(k))−1˜Xi)−1(∑Ni=1(˜Xi)T(^Vi(k))−1(˜Yi−˜Xiˆβ(k))).
Step 3: Iterate Step 2 until convergence, and denote the final estimators as the GEEs estimators. Then ˆu=((IN⊗τT)T(IN⊗τT))−1(IN⊗τT)T(¯Y−¯Xˆβ).
Here we can also utilize the existing algorithm to obtain estimates of β, which can be obtained by using the geepack function package in the statistical software R language.
4.
Numerical illustrations
This section compares the proposed GEEs method with the LSDV method proposed by Ji et al.[20] through computational experiments demonstration. In computational experiments, two types of panel interval-valued datasets are considered. The first is a dataset of synthetic panel interval-valued data, while the second is the air pollution dataset of the 26 cities in China's Yangtze River Delta.
4.1. Monte Carlo simulations
Without loss of generality, we provide two configuration schemes for synthetic panel interval-valued datasets: one from the min-max model and another derived from the CRM model. In the simulation framework, 20 or 50 individuals are considered, and each individual consists of three attributes of explanatory variables over 14 continuous periods, that is, N=20 or 50 and T=14. Let the first 10 periods of data be the train set, and the rest be the test set. Considering the actual problem scenarios, let the correlation matrix be the first-order autoregressive (AR(1)) correlation setting R1(α) or exchangeable correlation setting R2(α).
4.1.1. Dataset configuration based on the CRM model
In this section, we consider the following panel interval-valued data-generating processes.
ycit=uci+3∑j=1βcjxcitj+ϵcit,
yrit=uri+3∑j=1βrjxritj+ϵrit,
where βc=(βc1,βc2,βc3)=(1,3,2),βr=(βr1,βr2,βr3)=(1,2,1); xcijt∼U[5,10], xrijt∼U[1,3]; fixed effects uci and uri∼U[0,0.5]; ϵci=(ϵci1,ϵci2,...,ϵciT)∼N(0,σ2cRi(α)), ϵri=(ϵri1,ϵri2,...,ϵriT)∼N(0,σ2rRi(α)), where σ2c=σ2r=4,i=1,2; and let the correlation coefficient α=0.9,0.5,0.3 to generate data with different degrees of correlation.
4.1.2. Dataset configuration based on the min-max model
In this section, we consider the following panel interval-valued data-generating processes.
ylit=uli+3∑j=1βljxlitj+ϵlit,
yuit=uui+3∑j=1βujxuitj+ϵuit,
where βu=(βu1,βu2,βu3)=(3,2,2),βl=(βl1,βl2,βl3)=(2,1,1); xuijt∼U[5,10], xlijt∼U[1,3]; fixed effects uui and uli∼U[0,0.5]; ϵui=(ϵui1,ϵui2,...,ϵuiT)∼N(0,σ2uRi(α)), ϵli=(ϵli1,ϵli2,...,ϵliT)∼N(0,σ2lRi(α)), where σ2u=σ2l=4,i=1,2; and let the correlation coefficient α=0.9,0.5,0.3 to generate data with different degrees of correlation.
4.1.3. Implementation and evaluation
Four criteria for assessing the performances of different models are shown as follows:
(1) Mean average absolute error (MAE), mean magnitude of relative error (MMER), and root mean square error (RMSE) defined by Ji et al.[20] are:
(2) The rate of different intervals (RI) defined by Hu and He [33] is:
RI=1NTN∑i=1T∑t=1w(yit∩ˆyit)w(yit∪ˆyit),
where w(.) is the interval width.
Tables 1-4 show the fitting results when the working matrix is specified correctly, while Tables 5-8 show the fitting results when the working matrix is specified incorrectly. Figures 1–4 display the box plots of four evaluation indices calculated from the GEEs method in different models with the dataset generated based on the CRM model or the min-max model, when the working matrices are specified correctly. The results of the box plots are similar, when the working matrices are specified incorrectly.
Table 1.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and AR(1) correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 2.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and AR(1) correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 3.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and AR(1) correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 4.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and AR(1) correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 5.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and exchangeable correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 6.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and exchangeable correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 7.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and exchangeable correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 8.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and exchangeable correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Figure 1.
The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the CRM model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.9.
Figure 2.
The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the CRM model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.3.
Figure 3.
The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the min-max model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.9.
Figure 4.
The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the min-max model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.3.
● Tables 1 and 2 list the fitting results of β, and we have the following findings: (i) Regardless of the data generated by which model, the GEEs method performs better than the LSDV method. (ii) As α or N increases, the mean square error of GEEs estimation decreases. It means that the higher the degree of the correlation within the group, the better the performance of the GEEs. This also verifies the consistency of the estimate.
● Tables 3 and 4 show that (i) the GEEs method performs better than the LSDV method, regardless of the degree of correlation within the group of the data and the sample size, according to the four indicators. (ii) The CRM model performs best when data is generated based on centers and the radius. Similarly, the min-max model performs best when data is generated based on minimum and maximum values.
● From Tables 5–8, we can see that the fitting result of the GEEs method is close to that of the LSDV method when the working correlation matrix is incorrectly specified.
● Figures 1 and 2 display the box plots of four evaluation indices calculated from the GEEs method in different models with the dataset generated based on the CRM model, when the working matrices are specified correctly with N=20, α=0.9,0.3. We can see that the CRM model performs better than the other models.
● Figures 3 and 4 display the box plots of four evaluation indices calculated from the GEEs method in different models with the dataset generated based on the min-max model, when the working matrices are specified correctly with N=20, α=0.9,0.3. We can see that the min-max model performs better than the other models.
4.2. Real data analysis
As everyone knows, PM2.5 is the main pollutant affecting human health and atmospheric environment quality. With the continuous development of social economy and the acceleration of the industrialization process, the problem of air pollution is also increasing. Studies have shown that O3,SO2,NO2,CO,PM10, five air pollutants, have a significant impact on PM2.5 concentrations. Exploring the relationship between these air pollutants and PM2.5 can provide help for effective prevention and control of air pollution. So, this section selects the 26 cities in China's Yangtze River Delta as the research object, and analyzes the interval-valued data of air pollution for them. The daily data of six air quality indices of the Yangtze River Delta urban agglomeration during the 59 days from January 1 to February 28, 2023, can be obtained from the China Air Quality Online Monitoring platform (https://www.aqistudy.cn/historydata/). Through inspection, there are no outliers or missing values in the data. Then we consider the maximum and minimum values of the daily data of each index as the interval-valued dataset.
The observed data of PM2.5 are taken as response variables, and the other five indices are taken as independent variables. The data from the first 54 days are used as the training set, and the remaining 5 days are used as the test set. Substituting the data into the models is shown in Sections 2 and 3. Finally, we demonstrate the fitting performance of the GEEs method and the LSDV method when the working correlation matrix is AR(1). The predictive performance of the two methods are listed in Table 9, and the regression results of the GEEs method are shown in Table 10.
Table 9.
Comparison of GEEs and LSDV methods in each model for dealing with the panel interval-valued air pollutant dataset.
Table 10.
The estimated regression coefficients β and their robust standard deviation estimated by the GEEs method in different models, with their significance tests shown in parentheses ('***'-0.001, '**'-0.01, '*'-0.05, '.'-0.1).
● Table 9 shows that both methods have good fitting effects on actual data in different models. Moreover, we can see that the GEEs method had better fitting performance than the LSDV method in different models. Taking into account the four performance indices, the min-max model performs better than other models, which may be due to the type of experimental data, which is also consistent with the conclusions in the Monte Carlo experiments.
● Table 10 shows that there is a positive correlation between PM10,NO2,CO, and PM2.5, while there is a negative correlation between SO2,O3, and PM2.5. Especially, CO has the greatest impact on PM2.5.
5.
Conclusions and discussion
In this paper, we proposed an innovative approach which was to apply GEEs to fixed effects panel interval-valued data models, in which the correlation within a group is considered. Monte Carlo simulation experiments test the feasibility of this method and our proposed method was applied to PM2.5 forecasting. Experiments show that the proposed method has a better performance than the LSDV method proposed before.
However, the estimation method we proposed has some limitations. For example, the model can be extended to the nonlinear regression model to capture complex nonlinear relationships between variables. In addition, we did not take into account the correlation between the bounds of the panel interval-valued data. For future studies, we will introduce variance modeling or joint modeling to obtain better estimators.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
The authors express their thanks to the Editor and all referees for their constructive suggestions, which led to improvement of an earlier version of this paper.
Conflict of interest
The authors declare there are no conflicts of interest.
References
[1]
E. Diday, The symbolic approach in clustering and related methods of data analysis, classification and related methods of data analysis, in Proceedings of the first Conference of the Federation of the classification societies. North Holland, 1988.
[2]
L. Billard, E. Diday, Regression analysis for interval-valued data, in Data Analysis, Classification, and Related Methods, Springer, (2000), 369–374.
[3]
L. Billard, E. Diday, Symbolic regression analysis, in Classification, Clustering, and Data Analysis: Recent Advances and Applications, Springer, (2002), 281–288.
[4]
E. d. A. L. Neto, F. D. A. De Carvalho, Centre and range method for fitting a linear regression model to symbolic interval data, Comput. Stat. Data Anal., 52 (2008), 1500–1515. https://doi.org/10.1016/j.csda.2007.04.014 doi: 10.1016/j.csda.2007.04.014
[5]
L. Kong, X. Gao, A regularized MM estimate for interval-valued regression, Expert Syst. Appl., 238 (2024), 122044. https://doi.org/10.1016/j.eswa.2023.122044 doi: 10.1016/j.eswa.2023.122044
[6]
M. Xu, Z. Qin, A bivariate bayesian method for interval-valued regression models, Knowl.-Based Syst., 235 (2022), 107396. https://doi.org/10.1016/j.knosys.2021.107396 doi: 10.1016/j.knosys.2021.107396
[7]
U. Beyaztas, H. L. Shang, A. S. G. Abdel-Salam, Functional linear models for interval-valued data, Commun. Stat.-Simul. Comput., 51 (2022), 3513–3532. https://doi.org/10.1080/03610918.2020.1714662 doi: 10.1080/03610918.2020.1714662
[8]
Q. Zhao, H. Wang, S. Wang, Robust regression for interval-valued data based on midpoints and log-ranges, Adv. Data Anal. Classif., 17 (2023), 583–621. https://doi.org/10.1007/s11634-022-00518-2 doi: 10.1007/s11634-022-00518-2
[9]
L. C. Lin, H. L. Chien, S. Lee, Symbolic interval-valued data analysis for time series based on auto-interval-regressive models, Stat. Method. Appl., 30 (2021), 295–315. https://doi.org/10.1007/s10260-020-00525-7 doi: 10.1007/s10260-020-00525-7
[10]
J. Zhang, M. Liu, M. Dong, Variational bayesian inference for interval regression with an asymmetric laplace distribution, Neurocomputing, 323 (2019), 214–230. https://doi.org/10.1016/j.neucom.2018.09.083 doi: 10.1016/j.neucom.2018.09.083
[11]
C. Yang, Interval riccati equation-based and non-probabilistic dynamic reliability-constrained multi-objective optimal vibration control with multi-source uncertainties, J. Sound Vibr., 595 (2025), 118742. https://doi.org/10.1016/j.jsv.2024.118742 doi: 10.1016/j.jsv.2024.118742
[12]
C. Yang, Y. Liu, H. Gao, Reliability-constrained uncertain spacecraft sliding mode attitude tracking control with interval parameters, IEEE Trans. Aerosp. Electron. Syst., 61 (2025), 1–14. https://doi.org/10.1109/TAES.2025.3529798 doi: 10.1109/TAES.2025.3529798
[13]
A. F. Galvao Jr, Quantile regression for dynamic panel data with fixed effects, J. Econom., 164 (2011), 142–157. https://doi.org/10.1016/j.jeconom.2011.02.016 doi: 10.1016/j.jeconom.2011.02.016
[14]
E. Aristodemou, Semiparametric identification in panel data discrete response models, J. Econom., 220 (2021), 253–271. https://doi.org/10.1016/j.jeconom.2020.04.002 doi: 10.1016/j.jeconom.2020.04.002
[15]
B. H. Beyaztas, S. Bandyopadhyay, Robust estimation for linear panel data models, Stat. Med., 39 (2020), 4421–4438. https://doi.org/10.1002/sim.8732 doi: 10.1002/sim.8732
[16]
J. Bai, S. H. Choi, Y. Liao, Feasible generalized least squares for panel data with cross-sectional and serial correlations, Empir. Econ., 60 (2021), 309–326. https://doi.org/10.1007/s00181-020-01977-2 doi: 10.1007/s00181-020-01977-2
[17]
B. H. Baltagi, D. Li, Prediction in the panel data model with spatial correlation, in Advances in Spatial Econometrics: Methodology, Tools and Applications, Springer, (2004), 283–295.
[18]
J. P. Elhorst, Spatial panel data models, in Spatial Econometrics: From Cross-Sectional Data to Spatial Panels, Springer, (2014), 37–93.
[19]
Y. Zhang, J. Jiang, Y. Feng, Penalized quantile regression for spatial panel data with fixed effects, Commun. Stat.-Theory Methods, 52 (2023), 1287–1299. https://doi.org/10.1080/03610926.2021.1934028 doi: 10.1080/03610926.2021.1934028
[20]
A. Ji, J. Zhang, X. He, Y. Zhang, Fixed effects panel interval-valued data models and applications, Knowl.-Based Syst., 237 (2022), 107798. https://doi.org/10.1016/j.knosys.2021.107798 doi: 10.1016/j.knosys.2021.107798
[21]
J. Zhang, Q. Li, B. Wei, A. Ji, Robust estimation method for panel interval-valued data model with fixed effects, J. Stat. Comput. Simul., 94 (2024), 3230–3251. https://doi.org/10.1080/00949655.2024.2381491 doi: 10.1080/00949655.2024.2381491
[22]
A. Ji, J. Zhang, Y. Cao, Bivariate maximum likelihood method for fixed effects panel interval-valued data models, Comput. Econ., 64 (2024), 1–28. https://doi.org/10.1007/s10614-024-10737-8 doi: 10.1007/s10614-024-10737-8
[23]
K. Y. Liang, S. L. Zeger, Longitudinal data analysis using generalized linear models, Biometrika, 73 (1986), 13–22. https://doi.org/10.1093/biomet/73.1.13 doi: 10.1093/biomet/73.1.13
[24]
M. Crowder, On consistency and inconsistency of estimating equations, Econom. Theory, 2 (1986), 305–330. https://doi.org/10.1017/S0266466600011646 doi: 10.1017/S0266466600011646
[25]
R. M. Balan, I. S. Kratina, Asymptotic results with generalized estimating equations for longitudinal data, Ann. Stat., 33 (2005), 522–541. https://doi.org/10.1214/009053604000001255 doi: 10.1214/009053604000001255
[26]
L. Wang, Gee analysis of clustered binary data with diverging number of covariates, Ann. Stat., 39 (2011), 389–417. https://doi.org/10.1214/10-AOS846 doi: 10.1214/10-AOS846
[27]
Y. G. Wang, V. Carey, Working correlation structure misspecification, estimation and covariate design: implications for generalised estimating equations performance, Biometrika, 90 (2003), 29–41. https://doi.org/10.1093/biomet/90.1.29 doi: 10.1093/biomet/90.1.29
[28]
W. Pan, Model selection in estimating equations, Biometrics, 57 (2001), 529–534. https://doi.org/10.1111/j.0006-341X.2001.00529.x doi: 10.1111/j.0006-341X.2001.00529.x
[29]
M. Wang, Generalized estimating equations in longitudinal data analysis: a review and recent developments, Adv. Stat., 2014 (2014), 303728. https://doi.org/10.1155/2014/303728 doi: 10.1155/2014/303728
[30]
S. Seaman, A. Copas, Doubly robust generalized estimating equations for longitudinal data, Stat. Med., 28 (2009), 937–955. https://doi.org/10.1002/sim.3520 doi: 10.1002/sim.3520
[31]
N. R. Chaganty, An alternative approach to the analysis of longitudinal data via generalized estimating equations, J. Stat. Plan. Infer., 63 (1997), 39–54. https://doi.org/10.1016/S0378-3758(96)00203-0} doi: 10.1016/S0378-3758(96)00203-0
[32]
N. R. Chaganty, J. Shults, On eliminating the asymptotic bias in the quasi-least squares estimate of the correlation parameter, J. Stat. Plan. Infer., 76 (1999), 145–161. https://doi.org/10.1016/S0378-3758(98)00180-3 doi: 10.1016/S0378-3758(98)00180-3
[33]
C. Hu, L. T. He, An application of interval methods to stock market forecasting, Reliab. Comput., 13 (2007), 423–434. https://doi.org/10.1007/s11155-007-9039-4 doi: 10.1007/s11155-007-9039-4
Table 1.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and AR(1) correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 2.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and AR(1) correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 3.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and AR(1) correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 4.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and AR(1) correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 5.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and exchangeable correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 6.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and exchangeable correlation matrix, with the average and mean square error, where the working correlation matrix is AR(1).
Table 7.
Comparison for the GEEs and LSDV methods based on the dataset generated by the CRM model and exchangeable correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 8.
Comparison for the GEEs and LSDV methods based on the dataset generated by the min-max model and exchangeable correlation matrix, with the average and standard deviation, where the working correlation matrix is AR(1).
Table 10.
The estimated regression coefficients β and their robust standard deviation estimated by the GEEs method in different models, with their significance tests shown in parentheses ('***'-0.001, '**'-0.01, '*'-0.05, '.'-0.1).
Figure 1. The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the CRM model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.9
Figure 2. The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the CRM model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.3
Figure 3. The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the min-max model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.9
Figure 4. The box plots of four evaluation indices from the GEEs and LSDV methods with the dataset generated by the min-max model and AR(1) correlation matrix, and the working correlation matrix specified correctly with N=20, α=0.3