1.
Introduction
It is standard procedure in sampling theory to include auxiliary variables with the study variable in order to improve design and increase the efficiency of the estimator by utilizing their relationship. Although information about auxiliary variables is sometimes unavailable in practical circumstances prior to conducting a survey, in such instances, a two-phase sampling procedure is preferable. Two steps are used in two-phase sampling, sometimes referred to as double sampling, to choose a sample from a population. Since two-phase sampling is an economical sampling strategy, it is frequently employed in sample surveys when supplementary data is not available ahead of time. A brief summary of two-phase sampling was initially introduced by [1]. Such works were not explored further after that, until the works of [2]. Due to its low-cost variable screening qualities, two-phase sampling has received a lot of interest in recent years. For estimation of finite population mean under two-phase sampling schemes, different estimators proposed by[3,4,5]. In order to estimate the finite population variance, different estimators suggested by [7,8]. For more information, see [9,10,11] and references therein.
Since variation occurs naturally, estimating finite population variance is a serious problem. The utilization of auxiliary information to estimate the population variance was initially introduced by [12] and then expanded upon by [13]. Employing supplementary information in an informed strategy can improve estimators performance. In order to determine the population variance, [14] proposed exponential estimators based on ratios and products. By using the different transformations, [15,16,17] introduced some new estimators to improve the variance estimation. Under simple random sampling and stratified random sampling, different families of estimators obtained by [18,19,20]. For more details about different estimators and methods for estimating the finite population variances, we refer to [21,22,23].
In the sample survey data, there may be unusual observations. When the sample contains outlier values, the results may be distorted. In this regard, several researchers have focused on outlier values and presented various methods to estimate population characteristics. The researchers in[24] used a linear transformation to obtain two estimators based on the auxiliary minimum and maximum observations. After that, these works were not investigated until [25]. The researchers employed numerous finite population mean estimators, as well as the concept of using extreme values in them. For calculating the finite population mean, [26,27] introduced different transformations methods to handle the outliers. [28] used stratified random sampling to improve the estimate of the limited population mean under extreme values. [29] provided effective estimators for estimating population variance using extreme value transformations. The work [30] proposed novel estimators that use extreme values to estimate population variance with the least mean squared errors (MSE). [31] proposed double exponential ratio estimators that use extreme values of the auxiliary variable to evaluate their effectiveness in estimating population variance. To improve estimator accuracy, [32] developed efficient estimators that leverage auxiliary variables under simple random sampling. For further information, readers can read [33,34].
Several important considerations motivated the development of a new method to estimate the finite population variance:
● Traditional estimators for finite population variance frequently neglect extreme values (outliers) and rankings of auxiliary variables. Outliers are often considered challenging, resulting in skewed conclusions or inflated MSE. The inefficiency of stratified two-phase sampling designs highlights the need for a more efficient approach that addresses these problems.
● Existing estimators often struggle with stratified two-phase sampling due to its complicated data structure. These issues emphasize the need for more robust and efficient estimators.
● In most cases, two-phase sampling is more economical than one-phase sample, particularly when dealing with large populations. It lowers total expenditures by enabling researchers to gather preliminary data with a smaller sample before selecting a second sample.
● Two-phase sampling enables researchers to choose certain clusters or strata that reflect the whole population, it helps guarantee that varied sub-populations are effectively represented.
● Two-phase sampling is a useful method in a variety of research situations because it offers more accurate representation and control of variability along with cost savings, enhanced precision, and flexibility.
In this article, our main objective is to properly utilize the information about the outlier values of the auxiliary variable, which are used as supplementary information to increase the accuracy of the proposed class of estimators. It is well known that outlier information is often removed from sample data, and therefore classical estimators generally decrease its significance as MSE increases. When there is a relationship between the two variables, the ranks of the auxiliary variable are linked to the study variable. Consequently, these rankings provide a useful tool to improve the accuracy of the estimators. We apply the transformations technique, motivated by [29,30,31,32], to provide a new class of estimators using the ranks of the auxiliary variable and the known information on the outlier values to estimate the finite population variance in two-phase stratified sampling. The new suggested method is particularly useful in economic surveys, public health examinations, and environmental evaluations, where similar sample strategies are often used. The new estimators are ideal for disciplines like market research and agricultural surveys that frequently meet extreme values, as they can efficiently include outlier information without distorting results.
This article is organized as follows: In Section 2, we introduce the foundational concepts and notations. In Section 3, we discuss various established estimators. Our proposed class of estimators is detailed in Section 4. A theoretical comparison is presented in Section 5. In Section 6, we conduct simulations on six different artificial populations with varying probability distributions to evaluate the theoretical results discussed in Section 5. This section also provides numerical examples to validate our theoretical findings. Finally, in Section7, we offer a discussion of the results and suggestions for future research.
2.
Concepts and notations
Let us consider a finite population
of size N units. This population is divided into L strata, each of which is Nh(h=1,2,…,L), with the property that
Let yhi, xhi, and rhi be the values of the study variable (Y), the auxiliary variable (X), and the ranks of the auxiliary variable R in the hth stratum of the ith(i=1,2,…,Nh) unit, respectively. We define the population variances for these variables in the hth stratum as
where
and
denote the population means of the study variable (Y), auxiliary variable (X), and the ranks of the auxiliary variable (R) in the hth stratum that corresponding to the population means
respectively, where Wh is the stratum weight and defined by
The population coefficients of variations in the hth stratum, are defined as
and
where Syh,Sxh, and Srh are the population standard deviations of (Y,X,R) in the hth stratum, respectively.
Furthermore, define the population correlation coefficients between (Y,X), (Y,R), and (X,R) in the hth stratum as follows:
where
and
are the population co-variances, respectively.
In this paper, we provide a set of estimators to estimate the finite population variance S2y of Y in the presence of the auxiliary variable X. The definition of the two-phase sampling scheme is:
(1) A sample of size (´nh<Nh) from the first phase is chosen in order to estimate the population variance S2xh.
(2) For the second phase, a sample size of (nh<´nh) is chosen in order to observe both y and x, respectively.
We define the following concepts in order to calculate the biases and mean square errors for different estimators:
such that
for i=0,1,2,3,4.
where
Also
where
Here,
are the population coefficients of kurtosis.
3.
Literature review
Next, we review the other estimators of the finite population variances while comparing them with the estimators in our proposed class.
The variance of the usual estimator
in stratified random sampling is defined as follows:
The unbiased estimator ˆS2T1 of S2yst, is defined as
The usual variance estimator of ˆS2T1 for population variance is given by
A ratio estimator for population variance ˆS2T2, proposed by [13], is given by
The following equations represent the bias and MSE of ˆS2T2;
and
According to [35], the linear regression estimator ˆS2T3, is defined as
where
is the sample regression coefficient.
The following equation represents a MSE of ˆS2T3;
where
An exponential ratio type estimator ˆS2T4, presented by [14], is defined as follows
The following equations represent the bias and MSE of ˆS2T4;
and
By employing the kurtosis of an auxiliary variable, [15] proposed a ratio-type estimator ˆS2T5, is defined as
The following equations represent the bias and MSE of ˆS2T5;
and
where
The classifications of some ratio estimators is given in [17], which are defined as
and
The following equations represent the bias and MSE of ˆS2Ti;
and
where
4.
Proposed class of estimators
In this section, we present an improved class of estimators inspired by prior works [30,31,32]. These estimators employ minimum and maximum values of auxiliary variables, along with their ranks, in two-phase sampling to estimate the variance of the finite population. The suggested estimator is defined as
where (θih,i=1,2) are known constants values either (1 or 2), and (γih,i=1,2,3,4) are the parameters of the auxiliary variables. The minimum and maximum values (outliers) of the auxiliary variable are denoted by (xmh,xMh), while the minimum and maximum values (outliers) of the ranks of the auxiliary variable are denoted by (Rmh,RMh). The known values of γ1h,γ2h are given in Table 1,
and
We can derive the various classes of the suggested estimator from (4.1), which are listed in Table 1.
where
Now, we discuss the properties of the new proposed class of estimators, we rewrite (4.1) in terms of errors to get the bias and the MSE of ˆS2Q, i.e.,
where
Applying the Taylor series to the first approximation order, we obtain
Using (4.3), the bias of ˆS2T is given by
After the simple simplifications, we get
where
The Eq (4.3) is squared and the expected value is taken to obtain a first-order approximation of the MSE, which is represented by the following equation
After the simplification, we get
5.
Mathematical comparison
The proposed class of estimators ˆS2Q is compared in this section to other existing estimators, including ˆS2T1,ˆS2T2,ˆS2T3,ˆS2T4,ˆS2T5, and ˆS2Ti.
Condition (ⅰ): By (3.1) and (4.5)
if
For
that is,
Similarly
that is,
If condition (5.1) or (5.2) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2T1).
Condition (ⅱ): By (3.4) and (4.5)
if
For
that is,
Similarly,
that is,
If condition (5.3) or (5.4) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2T2).
Condition (ⅲ): By (3.6) and (4.5)
if
For
that is,
Similarly,
that is,
If condition (5.5) or (5.6) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2T3).
Condition (ⅳ): By (3.9) and (4.5)
if
For
that is,
Similarly,
that is,
If condition (5.7) or (5.8) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2T4).
Condition (ⅴ): By (3.12) and (4.5)
if
For
that is,
Similarly,
that is,
If condition (5.9) or (5.10) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2T5).
Condition (ⅵ): By (3.17) and (4.5)
if
For
that is,
Similarly,
that is,
If condition (5.11) or (5.12) holds true, the suggested estimator ˆS2Q demonstrates higher efficiency in comparison to MSE(ˆS2Ti).
6.
Numerical comparison
In this part, we examine the performance of the proposed class of estimators as compared to other estimators using percent relative efficiency (PREs). This examination is carried out using both simulated and three separate real data sets.
6.1. Simulation study
To confirm the theoretical results reported in Section 5, we use the methods proposed by [30,31,32] to undertake a simulation study. The goal is to evaluate the performance of the suggested class of estimators using the known minimum and maximum values of the auxiliary variable, as well as its ranks within the context of two-phase stratified sampling. The following probability distributions can possibly be used to artificially produce six distinct populations for the auxiliary variable X:
● Population 1: X∼Exponential (1);
● Population 2: X∼Exponential (3);
● Population 3: X∼Uniform (1,3);
● Population 4: X∼Uniform (1,2);
● Population 5: X∼Gamma (1,4);
● Population 6: X∼Gamma (2,5).
The variable of interest, Y, is computed as
where
indicates the correlation coefficient between the study and the auxiliary variables, and e∼N(0,1) signifies the error term.
To compute the PREs, we used the following algorithms in R:
Step 1: We first use the various probability distributions mentioned above to generate a population of size 2000. In order to compute distinct values for the existing and suggested class of estimators, this population is split into two strata using stratified random sampling techniques.
Step 2: To collect a first phase sample of size ´nh from a population of size Nh, use the simple random sampling without replacement {(SRSWOR)} technique.
Step 3: Using the {SRSWOR} technique, obtain the second phase sample size nh from the first phase sample.
Step 4: We calculate the population total and the extreme values of the auxiliary variables from the above steps.
Step 5: For each population, we use SRSWOR approach to generate distinct sample sizes for each stratum. The sample sizes are specified as 20%,30%, and 40%.
Step 6: Obtained the PREs values for each sample size using all of the estimators presented in this article. This step ensures that the relative efficiency of each estimator is evaluated across different sample sizes.
Step 7: Steps 5 and 6 are then repeated 50,000 times to ensure the robustness of the results. The outcomes for artificial populations are presented in Table 2, which provides a comprehensive analysis of the estimators performance under simulated conditions. Step 8: Furthermore, obtain the MSEs and PREs for each estimator over all replications using the following formulas:
and
where l is one of T1,T2,T3,T4,T5,T6,T7,T8,TQk(k=1,2,…,8).
6.2. Numerical examples
To evaluate the effectiveness of the suggested estimators, we examine the PREs of several estimators on three real data sets. The data sets descriptions are given below, while the summary statistics are given in Tables 3–5.
● Data 1. (Source: ([36]HY__HY, p.226]))
Y: The employment levels recorded by the different departments for 2012, which represents the overall number of workers.
X: The total number of factories that these departments officially registered in 2012, which gives information on industrial activity.
R: The rankings assigned to each department based on the total number of factories they registered in 2012, offering a comparative view of industrial engagement across departments.
Two distinct groups have been created from the data-set:
Group 1: The Gujranwala, Rawalpindi, Sargodha, and Lahore divisions are included in this group; they all contribute to the examination of employment and industrial registration.
Group 2: This group represents another aspect of the information for comparison analysis and is made up of the divisions of Bahawalpur, Faisalabad, Multan, Sahiwal, and Khan.
● Data 2. (Source: [36]HY__HY, p.135])
Y: Represents the total number of students attended at educational institutions in 2012.
X: Represents the overall number of government-funded schools in 2012.
R: Represents the order of government-funded schools in 2012 according to the number of schools they had in that year.
Two distinct groups have been generated from the data-set:
Group 1: The Gujranwala, Rawalpindi, Sargodha, and Lahore divisions are included in this group; they all contribute to the examination of employment and industrial registration.
Group 2: This group represents another aspect of the information for comparison analysis and is made up of the divisions of Bahawalpur, Faisalabad, Multan, Sahiwal, and Khan.
● Data 3. (Source:[37,p.24])
Y: The expenses incurred on food by the family, directly related to their employment.
X: The total weekly income earned by the family, reflecting their financial resources for that period.
R: The ranking of families based on their weekly income, providing a comparative measure of their earnings.
For efficiency comparisons, we use the following formula:
where l is one of T1,T2,T3,T4,T5,T6,T7,T8,TQk(k=1,2,…,8).
Additionally, Table 6 presents a summary of the findings for real data-sets.
7.
Conclusions
A class of efficient estimators for estimating the finite population variance was introduced in this article. These estimators accounted for both the rankings and the auxiliary variable's extreme values. The theoretical prerequisites outlined in Section 5 show how the suggested class of estimators is more efficient than others, allowing for a comparison with those that already exist. To verify these limits, we conducted a simulation study and examined three empirical data sets. The outcomes, displayed in Table 2, demonstrate that the suggested class of estimators consistently performs better in terms of PREs than the other existing estimators. The theoretical results in Section 5 are further confirmed by the empirical data shown in Table 6. We draw the conclusion that, in comparison to the other estimators under consideration, the suggested class of estimators ˆS2Qi (i=1,2,3,…,8,) exhibits superior efficiency based on both simulation and empirical data. Because it has the lowest MSE of these suggested estimators, ˆS2Q2 is particularly preferable.
There are some advantages of this study in practical applications are given below:
● Improved accuracy and efficiency: Using extreme values and rankings of auxiliary variables, the novel approach improves precision and efficiency when calculating population variance. The suggested estimators outperform previous approaches, achieving PRE values of up to 385.467 in simulated experiments. This improved performance is especially valuable in real-world applications where survey data may contain outliers.
● Applicability in stratified two-phase sampling: The suggested approach is especially designed to support stratified two-phase sampling, which is usual in large-scale surveys when supplementary information may be unavailable until later stages. This makes the approach particularly useful in economic surveys, public health examinations, and environmental evaluations, where similar sample strategies are often used.
● Handling of outliers in practical contexts: The new estimators are ideal for disciplines like market research and agricultural surveys that frequently meet extreme values, as they can efficiently include outlier information without distorting results.
Benchmark analysis
For this study, a thorough benchmark analysis was conducted using the procedures listed below:
Selection of competing methods:
● Find and choose the estimators that are employed in stratified two-phase sampling to estimate the finite population variance. Regression-based estimators, exponential ratio estimators, and conventional variance estimators like these are a few examples.
● For an extensive comparison basis, use the most widely utilized techniques from the review of the literature, such as those put out by Isaki (1983)[13], Bahl and Tuteja (1991) [14], and Upadhyaya and Singh (1999) [15].
Performance metrics:
● PRE, which shows the improvement in effectiveness over a standard approach, should be the main tool used to assess how well various estimators perform.
● To examine estimators performance in entirety, take into account other measures, including bias, adaptability to outliers, and MSE.
● Analyze the computational efficiency of the suggested and existing estimators, particularly for large data sets.
Simulation study and real life data sets
● The research encompassed practical stratification situations and included a variety of artificial populations with varying probability distributions, including exponential, uniform, and gamma. Comparing the results using practical problems variables, such as industrial activity and employment levels, provided valuable insights into practical application, while several replications guaranteed statistical robustness.
● According to the findings, the suggested estimators continuously performed better than conventional techniques in terms of PRE, with appreciable gains over a range of sample sizes and distributions. The investigation was made more detailed by the use of statistical tests to validate the significance of the according to efficiency increases. The advantages of the novel methodology were established by this thorough study, which also supported its consideration by proving its superiority over other methods.
Moreover, we investigated the characteristics of the suggested efficient class of estimators using a two-phase stratified sampling technique. It is also conceivable to propose some novel estimators utilizing the non-response sampling approach, and our findings can be useful in determining the more efficient estimators with the lowest MSEs. It is also an appropriate topic for future investigation.
Conflict of interest
The author declares no conflict of interest.