1.
Introduction
The utilization of additional information during sampling has proven effective in accurately estimating the population mean. Incorporating a supplementary variable during the estimation phase is a valuable strategy for enhancing the precision of estimating the population mean for the study variables. When there is a positive or negative correlation between the study and auxiliary variables, we typically employ the ratio and product estimation methods. Survey statisticians have developed exponential-type estimators for various sample scenarios, as evidenced in the works of Singh and Vishwakarma [1], Grover et al. [2], Noor-ul-Amin and Hanif [3]. The latest and efficient estimators were produced by Sher et al. [4], Muneer et al. [5] and Choudhury and Singh [6], which are hybrid type estimators. For a more comprehensive understanding, Sabat et al. [7], Di Gravio et al. [8], and Zaman and Kadilar [9], demonstrate the common use of two-phase sampling in cases where collecting data on variables of interest is cost-prohibitive and data on variables that correlate with the variables of interest is more economical. For example, conducting on-site assessments of forest surveys in remote locations can be both challenging and expensive. However, aerial photography is a more cost-effective alternative, yielding comparable results regarding the type of forest without the need for costly ground visits. However, in many important situations, before the survey, the population mean ˉX of the supplementary variable is not known. In such a situation, the sample mean ˉx1=1n1n1∑i=1xi is used as auxiliary information obtained through an initial sample of n1 units (n1<N), taken through a simple random sampling w.o.r scheme. At the second stage, a sample of n(n<n1) units obtained in the same fashion, and sample means of both the study variable ˉy=1nn∑i=1yi and auxiliary variable ˉx=1nn∑i=1xi, are obtained from Tato and Singh [10] and Misra [11].
Suppose a population consists of N units Ω={ʊ1,ʊ2,ʊ3,…,ʊN}, and let a first-phase sample of n1 units be drawn from where the estimated mean value of auxiliary variable is obtained. Then, in a second phase, a sample of n (n < n1) units is drawn where both the study and auxiliary variables are measured. To obtain the expressions for MSE and bias, let us introduce the following terms:
Notations:
ξ0=ˉy−ˉYˉY→ is a relative error term for y; ξ1=ˉx−ˉXˉX→ is a relative error term for x, Cy→ is the population coefficient of variation (c.v.) for y; Cx→ is the population coefficient of variation (c.v.) for x; ρ → is the population correlation coefficient between y and x; and λ and λ1 are the finite population correction (fpc) factors for first- and second-phase sampling, respectively.
ˉy=ˉY(1+ξ0), ˉx=ˉX(1+ξ1), and ˉx1=ˉX(1+ξ2), so that E(ξ0)=E(ξ1)=E(ξ2)=0, E(ξ20)=λC2y, E(ξ21)=λC2x, E(ξ22)=λ1C2x, and E(ξ0ξ1)=λCyx, E(ξ0ξ2)=λ1Cyx, E(ξ1ξ2)=λ1C2x, where λ=N−nNn, λ1=N−n1n1N and R=ˉYˉX.
2.
Some existing estimators
Kumar and Bahl [12] proposed the usual ratio estimator of the population mean in two-phase sampling, as follows:
The ratio estimator is typically chosen when there is a positive correlation between the study and auxiliary variables. The mean square error (MSE) of their proposed estimator up to the first order of approximation is given as
where ψ=ρyxCyCx.
In situations when there is a negative correlation between the study variable and the auxiliary variable, the product estimator is typically preferred over the ratio estimator. In two-phase sampling, the classical product estimator of population mean is given as
The MSE of their proposed estimator up to the first order of approximation is given as
In two-phase sampling, Singh and Vishwakarma [1] suggested the exponential type of ratio and product estimators of the population mean of the research variable as
and
The MSEs of the above estimators up to the first order of approximation are given as
and
Yadav et al. [13] proposed the following exponential ratio and product estimators:
and
The MSE up to the first order of approximation of the above estimators for optimum values of α=g−2ψg and δ=g+2ψg are given as
and
where A=g−2ψ, B=g+2ψ, and g=nn1−n, λ″=λ−λ1=1n−1n1.
The classical unbiased regression estimator is used in two-phase sampling to estimate the population mean when the auxiliary variable and the research variable are correlated. It works like this:
where β is the regression coefficient of the main study variable y regressed on auxiliary variable x. The MSE up to the first order of approximation for the above linear regression estimator is obtained as
By combining Singh and Vishwakarma [1] and regression estimator with a linear transformation, in Ozgul and Cingi [14], the following estimator was proposed:
where ˉz1=uˉx1+v and ˉz=uˉx+v. and u, v are the generalized constants, meaning that for different values of u and v, one can obtain different estimators with different MSE. The minimum MSE up to the first order of approximation for k1=1−2−λ1θ2C2x1+(λ−λ1ρ2) and k2=R[(θ−1)+2−λ1θ2C2x1+(λ−λ1ρ2)(2θ−ψ)] is given as
or
where θ=uˉX2(uˉX+v).
3.
The proposed class of estimator
The use of auxiliary information commonly leads to improved precision of the estimator. Also, the development of novel estimators for the finite population mean in two-phase sampling is not only a theoretical advancement but also a practical necessity. These novel estimators, developed under existing constraints, are expected to make a substantial contribution to the field of survey sampling by providing more reliable and precise population estimates. Motivated by these objectives, we propose the following two families of estimators under a two-phase sampling scheme using a single auxiliary variable.
3.1. First proposed estimator
Motivated by Muneer et al. [5] and Shabbir et al. [15], we propose the following class of estimators for the population mean:
A number of estimators can be generated from the above estimator by assigning different values of u, v, and ω. Here, k1 and k2 are the minimizing constants whose values are determined by minimizing the MSE, and u, v, and ω, 0≤ω≤1, are the generalizing constants that can assume any suitable value or any known parameter of the population. The bias and MSE of the estimator are given as:
Theorem 1. An estimator for the population mean defined in Eq (18) in the case of two-phase sampling with single auxiliary variables will have its MSE equation given as
3.2. Second proposed estimator
Motivated by Shabbir et al. [15], Gupta and Shabbir [16], and Shabbir and Gupta [17], we propose the following improved class of estimators for a finite population mean in two-phase sampling:
where
Description. In Table 1, estimators 1–3 represent product exponential estimators obtained using the value ω=0. These estimators vary based on different values of u and v, derived from known parameter values. Estimators 4–6 are ratio exponential estimators obtained by setting ω=1. Estimators 7–11 are obtained by setting ω=12 in combination with different values of u and v. The final group of estimators are of a hybrid type, incorporating both ratio and product exponential forms, making them effectively efficient in both cases. Hence, this list of estimators covers a range of possible settings. Although many other possible estimators are possible, we have listed only a selected few.
The MSE of the proposed estimator is given in the following theorem.
Theorem 2. An estimator for the population mean defined in Eq (20) in the case of two-phase sampling with single auxiliary variables will have its MSE equation given as
where
4.
Efficiency comparison
4.1. For the first proposed estimator
Our proposed estimator Tpro is better than the existing estimators subject to the following conditions:
Condition (ⅰ): By comparing Eqs (2) and (19), MSE(tpro1)<MSE(tR) if
where ℜ1=AprD2pr+BprC2pr−2CprDprEprAprBpr−E2pr.
Condition (ⅱ): By comparing Eqs (4) and (19), MSE(tpro1)<MSE(tP) if
Condition (ⅲ): By comparing Eqs (7) and (19), MSE(tpro1)<MSE(tSVR) if
Condition (ⅳ): By comparing Eqs (8) and (19), MSE(tpro1)<MSE(tSVP) if
Condition (ⅴ): By comparing Eqs (11) and (19), MSE(tpro1)<MSE(tGR) if
Condition (ⅵ): By comparing Eqs (12) and (19), MSE(tpro1)<MSE(tGP) if
Condition (ⅶ): By comparing Eqs (14) and (19), MSE(tpro1)<MSE(tReg) if
Condition (ⅷ): By comparing Eqs (16) and (19), MSE(tpro1)<MSE(tOC) if
where ∇1=C2y(λ−λ1ρ2)(1−λ1θ2C2x)−λ21θ4C4x41+C2y(λ−λ1ρ2).
4.2. For the second proposed estimator
Our proposed estimator tpro2 is better than the existing estimators subject to the following conditions:
Condition (ⅰ): By comparing Eqs (2) and (21), MSE(tpro2)<MSE(tR) if
where ℜ2=ApD2p+BpC2p−2CpDpEp+Bp+2BpCp+D2p−2DpEp and ℜ3=ApBp−E2p+Bp.
Condition (ⅱ): By comparing Eqs (4) and (21), MSE(tpro2)<MSE(tP) if
Condition (ⅲ): By comparing Eqs (7) and (21), MSE(tpro2)<MSE(tSVR) if
Condition (ⅳ): By comparing Eqs (8) and (21), MSE(tpro2)<MSE(tSVP) if
Condition (ⅴ): By comparing Eqs (11) and (21), MSE(tpro2)<MSE(tGR) if
Condition (ⅵ): By comparing Eqs (12) and (21), MSE(tpro2)<MSE(tGP) if
Condition (ⅶ): By comparing Eqs (14) and (21), MSE(tpro2)<MSE(tReg) if
Condition (ⅷ): By comparing Eqs (16) and (21), MSE(tpro2)<MSE(tOC) if
5.
Numerical study
In this section, the MSEs of the proposed estimators, along with other existing estimators for real data sets, are given.
In all these data sets, assume the following notations: N → population size; n1 → first-phase sample size; n → second-phase sample size; Y → population mean of variable y; X → population mean of variable x; Cy → population coefficient of variation of y; Cx → population coefficient of variations of x; ρ → population correlation coefficient between y and x.
Population 1. [18] Where X is the population (in 1000s) in 1920 and Y is the population (in 1000s) in 1930 of 49 US cities.
Population 2. [19] Y: The total acreage planted with wheat in 34 communities in 1974. X: The total area planted with wheat (in acres) across 34 villages in 1971.
Population 3. [20] Y: The number of tube wells. X: The 69 communities' net irrigated area in hectares.
Population 4. [20] Y: The average number of hours spent sleeping. X: The person's age.
Population 5. [21]. Y: The number of agriculture workers in 1971. X: The number of agriculture workers in 1961.
Table 2 shows the MSE values of various estimators of the population mean in the two-phase sampling with a single auxiliary. The software R version 4.4.2 was utilized for all numerical analysis. Three specific scenarios are introduced—the first with no transformation (1, 0), the second utilizing Cz (1, Cz), and the third incorporating both Ryz and Cz (Ryz, Cz)—for each of the two groups of estimators. As shown in the table, the MSEs for all estimators in the proposed groups were considerably lower than those of the classical, ratio, and other estimators. Therefore, we can conclude that the proposed groups of estimators for the population mean in two-phase sampling with an auxiliary variable exhibit lower variances than all the existing estimators for the same parameters.
Table 3 shows the percentage relative efficiency (PRE) values of various estimators for the population mean in two-phase sampling with a single auxiliary variable. We introduce three distinct scenarios—the first with no transformation (1, 0), the second utilizing Cz (1, Cz), and the third incorporating both Ryz and Cz (Ryz, Cz)—for each of the two groups of estimators. As shown in the table, the PREs of all estimators belonging to the proposed groups exhibit a significantly higher level of efficiency compared to the classical, ratio, and other estimators available in the literature. Thus, we can confidently conclude that the proposed groups of estimators for the population mean in two-phase sampling with an auxiliary variable offer exceptional efficiency when compared to all existing estimators for the same parameters.
6.
Simulation study
In two-phase sampling with a single auxiliary variable, the population was divided into two phases. In the first phase, a sample was selected based on the auxiliary variable. In the second phase, one sample was selected from the first-phase sample. This design is commonly used to reduce sampling costs and increase efficiency.
Several estimators, including regression, ratio, and difference estimators, have been proposed to estimate the population mean by using this architecture. Simulations were performed to assess estimator performance. The general procedure for performing simulations of estimators of the finite population mean in two-phase sampling using a single auxiliary is presented below.
Algorithm: Simulation for estimator evaluation
ⅰ. Initialization of parameters
Define N,
Define auxiliary variables and their distribution functions,
Define population mean and variance covariance matrix.
ⅱ. Generate population
Based on the above parameters we generate a data set of size N.
ⅲ. Draw a first-phase sample and sample statistics
From this dataset randomly select an initial sample S1 called first-phase sample.
Calculate means, correlation coefficients, coefficient of variations, and other statistics for both variables y and x.
ⅳ. Draw a second-phase sample
Draw a second-phase sample S2 < S2.
ⅴ. Calculate population estimators' values
We calculate the values for the estimators of the population mean from this sample utilizing the statistics obtained from the first-phase sample as auxiliary information.
Store these values of estimators for performance evaluation.
ⅵ. Repeat steps iii-v ℜ times to obtain the distribution of the estimated population mean.
ⅶ. Evaluate estimators' performance
Calculate the distribution of estimators over ℜ repetition.
And obtain the MSE of the estimators
ⅷ. Analyze results
Repeat steps ⅲ-ⅶ for different sample sizes.
Compare MSEs across different scenarios to evaluate estimators' performance.
By conducting simulations, one can compare the performance of different estimators and choose the estimator that performs best under various scenarios.
Table 4 displays the MSE values of various estimators for the finite population mean in two-phase sampling with a single auxiliary variable on the simulated data. In this study, we introduced two novel families of estimators, tpro1 and tpro2, and presented three special cases for each of these estimators.
An analysis of the results indicates that our proposed estimators exhibit significantly lower MSEs compared to all estimators selected from the existing literature for the same parameter in similar scenarios. Moreover, for both of our proposed estimators, the estimators with u = Ryx and v = Cx consistently demonstrate smaller variances across all selected sample sizes. In addition, the values of the proposed estimators remain relatively stable across varying sample sizes.
Based on these findings, we confidently conclude that our proposed families of estimators are highly efficient and consistent for the population mean in two-phase sampling when compared with traditional estimators such as ratio, product, and other similar estimators.
Table 5 displays the PRE values of estimators for the finite population mean in two-phase sampling with a single auxiliary variable based on simulated data. We introduce two families of estimators, tpro1 and tpro2, from which numerous estimators can be obtained by applying different values of the generalizing constants u, v, and ω. However, we provide only three unique cases for each estimator. Upon examining the results, it is clear that the proposed estimators outperform all previously established estimators in the literature for the same parameter under comparable circumstances. Furthermore, the estimators with u = ρyx and v = Cx from both proposed families demonstrate smaller variances across all chosen sample sizes. Additionally, the estimators' values exhibit consistency across varying sample sizes, signifying that the proposed families are efficient and consistent estimators compared with conventional ratios, products, and other estimators for the population mean in two-phase sampling. In conclusion, the proposed estimators are highly efficient and consistent compared to the previously presented estimators.
7.
Conclusions
The utilization of auxiliary information has proven to be highly effective in improving efficiency; however, such information is not always available during the design and estimation processes or can be prohibitively expensive. Two-phase sampling procedures are frequently employed to overcome this challenge. In this approach, a sample is first obtained from the population, and auxiliary information is collected from this sample. In the second phase, a smaller sample is drawn from the first-phase sample, and both the primary and auxiliary variables are measured in this subsample. Various estimators such as ratio and product estimators are commonly used to estimate the finite population mean in two-phase sampling scenarios with a single auxiliary variable. Inspired by the work of [7,8,15], we have proposed two classes of estimators for the population mean and derived their expressions for bias and mean squared error (MSE) up to the first order of approximation. We identified the theoretical conditions under which the proposed estimators are more efficient than some existing estimators in terms of having less variance. The proposed estimator families were compared with existing mean estimators on both real and simulated data based on their MSE and percentage relative efficiency (PRE).
The results of our comparison strongly support the claim that the proposed estimator families are significantly more efficient than existing methods for estimating the finite population mean in a two-phase sampling scenario.
Author contributions
Khazan Sher: Conceptualization, project administration, writing–original draft, writing–review and editing; Muhammad Ameeq, Sidra Naz, Basem A. Alkhaleel: Conceptualization, project administration, investigation, writing–original draft, writing–review and editing; Muhammad Muneeb Hassan, Olyan Albalawi: Conceptualization, project administration, investigation, writing–original draft, writing–review and editing. All authors have read and agreed to the published version of the manuscript.
Use of Generative-AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
Researchers Supporting Project number (RSPD2025R630), King Saud University, Riyadh, Saudi Arabia.
Conflict of interest
The authors declare that there are no conflicts of interest.
Appendix
Proof 1. To obtain the MSE of the estimator given in Eq (18), we can express Eq (18) in terms of error as follows:
or
where η=uˉXuˉX+v.
Now, expanding with the exponential series and Taylor series, we have
or
Taking expectation on both sides of Eq (41), the bias of the estimator is given as
Now by taking the square of Eq (42), we get the following expression:
Taking expectations on both sides of Eq (43), we get the MSE expression as
where Δ1=ϑ21+2ϑ3, Δ2=ϑ21+2ϑ4 and Δ3=2(ϑ21+ϑ2).
Also, the simplified form is given as
To find the values of k1 and k2, we convert the above Eq (44) in the below simplified form
Now differentiating Eq (45) w.r.t. k1 and k2 and equating to zero, we get the following two equations:
So, we obtain
and
Solving both Eqs (46) and (47) simultaneously, we obtain the following optimum values k1=BprCpr−DprEprAprBpr−E2pr and k2=¯Y(AprDpr−CprEpr)AprBpr−E2pr. With these values, the minimum MSE of the estimator adopts the following form:
Proof 2. The MSE of the proposed estimator given in Eq (20) and its bias are obtained as follows:
or
where
The difference equation of the proposed estimator is given as follows:
After taking the expectation of Eq (51), we obtain the following bias expression
Taking the square on both sides of Eq (52) and simplifying, we have
Taking the expectation on both sides of the above Eq (53), we obtain the following MSE equation:
or
Let us differentiate Eq (55) and equate to zero to obtain optimum values of the constants k3 and k4
So, we obtain
and
Solving both Eqs (56) and (57) simultaneously, we obtain the following values of the constants k3 and k4
Incorporating these optimum values of the constants, the minimum MSE of the proposed estimator is given as