1.
Introduction
Supplementary information is a useful method in survey sampling for increasing estimators' precision. To enhance relative efficiency, methods such as ratio, regression, and product-type estimates utilize not only the data related to the primary study variables but also supplementary details from one or more auxiliary variables. Extensive research efforts have been dedicated to developing improved estimators for the overall parameters, including measures like the population mean, total, median, and various additional indicators. For more details about improved estimators and their properties, see [1,2,3] and references therein.
The sampling theory uses auxiliary variables as well as the study variable in order to get a better design, and it is also used to fine-tune the cost-effectiveness of the estimators by utilizing the relationship between the variables. In such situations, the two-phase sampling method is preferred, especially when the overall population mean is unavailable prior to conducting a survey. Survey methods that use two distinct stages to select samples from a population are called two-phase sampling methods. A preliminary phase can be conducted more cost-effectively or efficiently before the main phase when the main phase is more costly. The basic concept of two-phase sampling was initially introduced by [4]. Recently, it has attracted considerable interest because of its cost-effectiveness in assessing variables. A few investigations that have been conducted on two-phase sampling include [5,6,7,8].
The information obtained through a sample survey can produce unexpected results. When the sample contains extreme values, the mean estimator becomes susceptible to distortion, which may lead to biased results. Using the extreme values from known auxiliary variables, [9] first provided two estimators through the use of linear transformations. After this initial work, [10] suggested better estimators based on ratios, products, and regression to estimate population means more successfully by utilizing outlier values. In order to more accurately obtain the unknown population mean by considering extreme values, [11] introduced some estimators based on a two-phase strategy. A new set of transformations based on extreme values of auxiliary variables were provided by [12], which enhanced the estimation of the unknown population mean. Subsequently, [13] achieved significant improvements in increasing the precision of population mean estimations by introducing extreme values into a stratified random sampling technique. Using extreme value ideas as a foundation, [14,15] obtained a family of estimators intended to minimize the mean squared errors by using the population variance. In a recent work, [16,17,18] proposed various classes of highly efficient estimators by employing transformations on extreme values to accurately estimate the population variance. A ratio estimation of the population mean using auxiliary information under the optimal sampling design was proposed by [19]. Additional insights and related methodologies can be explored in [20,21,22].
1.1. Motivation and innovation
The motivation behind this study lies in the ongoing challenges of improving the accuracy and efficiency of population mean estimation, particularly in survey sampling. Traditional estimators often struggle with the presence of extreme values, such as the smallest and largest observations, which can distort the results. This issue is particularly critical in fields like economics, healthcare, and public health, where precise population estimates directly influence policy decisions and resource allocation.
Sometimes, there is a temptation to remove extreme observations from the data-sets. However, the accuracy of estimators often declines when the mean squared error (MSE) is calculated in scenarios involving extreme values. The innovation of this work lies in the development of these new estimators, which are designed to improve the precision of population mean estimation while minimizing the MSE compared with existing methods. In this article, we introduce different improved estimators that employ both the smallest and largest observations of independent variables and their ranks in a double-phase sampling design to improve the precision of estimation. The new estimators are widely applicable, with potential uses in areas like economics, healthcare, manufacturing, retail, and transportation. They are particularly effective in scenarios demanding precise predictions. The following are some implications of our suggested estimators.
1.2. Applications of the proposed estimators across domains
The new estimators are widely applicable, with potential uses in areas like economics, healthcare, manufacturing, retail, and transportation. They are particularly effective in scenarios demanding precise predictions.
1) Industry: In the industrial sector, our method supports predictive maintenance by monitoring equipment in manufacturing. For instance, in an automotive plant, it analyzes sensor data to forecast component failures, enabling timely repairs that reduce downtime and costs. This improves efficiency, lowers expenses, and ensures smoother production.
2) Economics: The estimators significantly improve economic forecasting by providing more accurate predictions of indicators like inflation, unemployment, and gross domestic product (GDP) growth. This makes it possible for the public and commercial sectors to allocate resources more effectively and make better policy decisions.
3) Environmental science The proposed estimators are important in environmental science, assisting in environmental forecasting and estimating pollution. They provide accurate predictions of ecological patterns, aiding in the projection of atmospheric shifts, pollution concentrations, and their effects on ecosystems.
4) Transportation: Through traffic pattern analysis and delay prediction, our approach improves route optimization for delivery services in the transportation industry. As a result, delivery times and fuel consumption can be reduced by logistics businesses planning more effective routes.
5) Public health: In public health, the estimators enhance epidemiological modeling and health outcome predictions. They enable more precise forecasting of disease spread, healthcare demands, and intervention planning. Organizations are better able to allocate resources and get ready for future healthcare demands due to the more accurate information.
6) Retail: Our approach more precisely predicts product demand in retail, which enhances inventory management. This enables retailers to keep inventory levels balanced, preventing both excess stock and shortages. These examples show how adaptable our methodology is and how it has the ability to significantly advance both industry and healthcare.
7) Medicine: Our approach can enhance early cancer detection by improving the accuracy of medical imaging analysis. When integrated with tools like magnetic resonance imaging (MRI) or computed tomography (CT) scans, it aids in identifying malignant tissues more effectively, enabling timely intervention. For example, it could improve tumor detection in breast cancer through better image segmentation and classification of abnormal tissue.
The structure of the paper is arranged as follows. The basic notations and concepts utilized in this study are explained in Section 2. Section 3 provides a summary of various existing estimators. The proposed classes of estimators are thoroughly discussed in Section 4. There is a rigorous mathematical comparison in Section 5 between existing and new proposed estimators. A description of a simulation conducted to produce some populations from various distributions is presented in Section 6, which is intended to evaluate the theoretical conclusions from Section 5. This section also presents numerical examples that illustrate the practical implications of the theoretical findings. Finally, Section 7 summarizes the overall conclusions and suggests some ideas for new research.
2.
Framework for methodology and notation
A finite population can be represented as Ω=(Ω1,Ω2,Ω3,…,ΩN), where N denotes the total number of units. In this context, let xi represent the value of the independent variable X, ri indicate the ranks associated with the independent variable R, and yi denote the value of the response variable Y for the ith unit.
This research explores the impact of the variable X and presents two new estimators developed to obtain the population mean Y. The two-phase sampling scheme is described by the following definition.
(1) We collect a sample without replacement of size m1 in the initial stage in order to accurately obtain the population mean, represented as ˉX. For accurate results, we must make sure that the sample size m1 remains under N.
(2) A second sample (without replacement) size of m2 observation (where m2<m1) is selected to determine the variables y and x in the second phase.
The following are the formulas for the population means, which are calculated as
and
Let the population variances for variables (X,R,Y) be defined as follows without replacement sampling:
while the population coefficients of variations for these variables are defined as:
and
respectively. Furthermore, the relationships among the variables (Y,X), (Y,R), and (X,R) are described through their population correlation coefficients as follows:
and
where
and
are the population co-variances, respectively.
Now, we define the first-phase sample means and variances base on m1 observations associated with the variables X and R, which are presented in the following manner:
and
To calculate the sample means and variances for the second phase, a random sample without replacement of size m2 observations is chosen based on the first phase (m2<m1), which are defined as:
and
3.
Existing estimators
The analysis of the mean squared errors and biases associated with the estimators used to determine the population mean is given in this section. These findings are then compared with those of the proposed classes of estimators to identify areas for improvement.
The conventional unbiased estimator is presented below:
The variance of the conventional unbiased estimator ˉyd is defined by:
where
is the sampling correction term for the second phase.
The following is an expression for a ratio-type estimator ˉydR according to [4]:
The following formulas are used to express the bias and MSE of ˉydR:
where
and
represent the sampling correction terms for the first and the intermediate stages.
The product estimator ˉydP is expressed as:
The following formulas are used to express the bias and MSE of ˉydP:
and
The standard regression estimator is denoted as ˉydlr, which is generally represented by:
where byx denotes the regression coefficient from the sample.
The following formulas are used to express the bias and MSE of ˉydlr:
where
and
According to [23], the definitions of the exponential ratio and product type estimators are given by
and
The expressions for MSE of ˉydRMD and ˉydPMD are, respectively, expressed as follows:
and
where
The authors of [24] proposed a double sampling estimator is defined as
where
The following formulas are used to express the bias and MSE of ˉydRP:
and
4.
Proposed generalized estimators
This section, inspired by the methodologies discussed in [25,26,27], presents some improved classes of estimators. These improved estimators employ both the smallest and largest observations of the independent variables and their ranks in a double-phase design to improve estimation precision. The proposed estimators are defined below:
and
where the scalar quantities (α1,α2) can assume the values (0,−1,1). It is important to determine the unknown constant values of (k1,k2) by using the scalar quantities (α1,α2) so that the biases and mean squared errors can be minimized. Meanwhile, b1=XM−Xm represents the difference between extreme observations of the independent variable, and b2=RM−Rm represents the difference between the highest and lowest ranks of the independent variable. In addition, a1 and a2 are the various transformation parameter values. A detailed description of the subsets of the proposed estimator-Ⅰ can be found in Table 1.
where
4.1. Properties of the improved estimator-Ⅰ
To determine the mathematical properties of different estimators, the relative error terms are utilized:
such that E(ei)=0.
Moreover,
In order to investigate the characteristics of the first improved estimator, we rewrite (4.1) using the error terms:
where
By performing a first-order Taylor series expansion and extending the right-hand sides of (4.3), excluding terms where ei>2, we get the following expression:
Using (4.4), the bias of ˉydU is given by
where
and
By applying expectation after squaring both sides of (4.4), we derive the corresponding equation that expresses the MSE of ˉydU.
where
and
By minimizing Eq (4.6), the optimal values for k1 and k2 can be determined, as shown below
and
The minimum values for bias and MSE of ˉydU are determined by replacing the optimum k1 and k2 into (4.5) and (4.6). The resulting expressions are given below:
and
4.2. Properties of the improved estimator-Ⅱ
To assess the behavior of the proposed estimator, we reformulate (4.2) using relative errors, which allows the computation of the Eqs (4.13) and (4.14), i.e.,
where
and
By performing a first-order Taylor series expansion and extending the right-hand sides of (4.9), excluding terms where ei>2, we get the following expression:
Using (4.10), the bias of ˉyde is given by:
By applying expectation after squaring both sides of (4.10), we derive the corresponding equation that expresses the MSE of ˉyde.
To express the bias and MSE for ˉyde, we substitute the known constants k3 and k4 into the formulas in (4.11) and (4.12). After simplifying the expressions, we get the following results:
and
5.
Mathematical comparison
In this section, we provide the efficiency conditions by using the mean squared error equation of the proposed estimators with the mean squared error equations of existing estimators, such as ˉyd, ˉydR, ˉydlr, ˉydRMD, ˉydPMD, and ˉydRP.
5.1. Improved estimator-Ⅰ
Condition (ⅰ): We derived the following results from (3.2) and (4.8) to ensure that the suggested estimator is at least as efficient:
Condition (ⅱ): The following expression obtained from (3.5) and (4.8), provides the required condition for the proposed estimator-Ⅰ to be more efficient:
Condition (ⅲ): The following condition obtained from (3.11) and (4.8), which provides that proposed estimator-Ⅰ has a smaller mean squared error:
Condition (ⅳ): The required condition for the proposed estimator-Ⅰ, determined from (3.14) and (4.8), shows that the proposed estimator-Ⅰ is more efficient:
Condition (ⅴ): The resulting expression derived from (3.15) and (4.8) defines the necessary condition for the proposed estimator-Ⅰ to show better efficiency:
Condition (ⅵ): The following inequality, derived from (3.18) and (4.8), indicates the effectiveness of the proposed estimator-Ⅰ:
5.2. Improved estimator-Ⅱ
Condition (ⅶ): From (3.2) and (4.14), we derive:
If θ is greater than ˊθ or, equivalently, θ>θ′, the following inequality is satisfied:
In a similar way, when θ is less than ˊθ, the following inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉyd) when either (5.1) or (5.2) holds.
Condition (ⅷ): From (3.5) and (4.14), we derive:
If θ is greater than ˊθ or, equivalently, θ>θ′, the following inequality is satisfied:
In a similar way, when θ is less than ˊθ, the subsequent inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉydR) when either (5.3) or (5.4) holds.
Condition (ⅸ): From (3.11) and (4.14), we derive
If θ is greater than ˊθ or, equivalently, θ>θ′, the following inequality is satisfied:
In a similar way, when θ is less than ˊθ, the following inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉydlr) when either (5.5) or (5.6) holds.
Condition (ⅹ): From (3.14) and (4.14), we derive
In a similar way, when θ is less than ˊθ, the following inequality holds true:
In a similar way, when θ is less than ˊθ, the subsequent inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉydRMD) when either (5.7) or (5.8) holds.
Condition (xi): From (3.15) and (4.14), we derive
If θ is greater than ˊθ or, equivalently, θ>θ′, the following inequality is satisfied:
In a similar way, when θ is less than ˊθ, the following inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉydPMD) when either (5.9) or (5.10) holds.
Condition (xii): From 3.18 and (4.14), we derive:
If θ is greater than ˊθ or, equivalently, θ>θ′, the following inequality is satisfied:
In a similar way, when θ is less than ˊθ, the following inequality holds true:
The proposed estimator ˉyde shows better performance relative to MSE(ˉydRP) when either (5.11) or (5.12) holds.
6.
Results of the numerical comparison
In this section, we compare the mean squared errors of the improved classes of estimators with those of existing estimators in order to provide a comprehensive analysis. This analysis relies on both simulated data and distinct real datasets. By analyzing the MSE across these simulated and different datasets, we aim to provide a thorough assessment of the performance and reliability of the new proposed estimators.
6.1. Simulation study
The simulation method from [28] is in this section, where the auxiliary variable X is collected from six populations, each associated with a different probability distribution.
● Data-1: X∼Gamma(η1=4,η2=6),
● Data-2: X∼Gamma(η1=8,η2=10),
● Data-3: X∼Uniform(ν1=3,ν2=5),
● Data-4: X∼Uniform(ν1=0,ν2=1),
● Data-5: X∼Exponential(μ=5),
● Data-6: X∼Exponential(μ=10).
Afterwards, by utilizing the correlation coefficient's fixed value
and the random error term
the dependent variable Y is computed by using the following equation:
On the basis of each distribution and correlation setting, we examined the MSEs of various suggested estimators and existing estimators to assess the robustness and efficiency.
Step 1: In order to generate 1000 observations, we start to obtain a population based on the different probability distributions mentioned above.
Step 2: In the first phase, simple random sampling without replacement (SRSWOR) is used to choose a sub-sample m1 from a whole population of size N.
Step 3: From the first-phase sample, the SRSWOR technique is again applied to draw a second-phase sub-sample with size m2.
Step 4: The total population, as well as the extreme observations of independent variables and their rankings, are calculated using the processes described above. We also determine the optimal values for the improved classes of estimators by considering unknown constants.
Step 5: Different sample sizes are computed for each population using the SRSWOR method.
Step 6: For each sample size under consideration, the MSE are calculated for all the estimators examined in this paper.
Step 7: After 45,000 repetitions of Steps 5 and 6; the MSE values for all the estimators are calculated on the basis of the formula defined as follows:
Step 8: Finally, Table 2 reports the MSE results for artificial populations.
6.2. Numerical examples
We assessed the performance of different estimators by calculating the mean squared errors across three separate datasets. The aim of this analysis was to check the accuracy of the proposed estimators. Detailed descriptions of the datasets, along with their summary statistics, are provided below.
Dataset 1. (Source: [29], p. 226)
Y: Information on the number of employees in each department for the year 2012, which indicates the size of the workforce across different industries.
X: The number of factories each department officially authorized in 2012b by providing information on the productivity of the department and the existence of the industry.
R: The departments are arranged according to the count of factories in 2012, used to compare relative industrial activity.
Dataset 2. (Source: [29], p. 135)
Y: Denotes the overall count of students enrolled in educational departments in 2012;
X: Denotes the total number of schools funded by the government in 2012;
R: Denotes the ranks of the total number of schools funded by the government in 2012 according to the number of schools.
Dataset 3. (Source: [1], p. 24)
Y: The food costs that the families paid for, which are directly related to their jobs;
X: The total weekly income that the families made, which shows their financial resources for that time;
R: the ranking of families according to their weekly income, which shows how much money they made compared with each other.
The above data sets are summarized in the following summary statistics given in Table 3, and mean squared comparison between new proposed estimators and existing estimators are displayed in Table 4.
6.3. Discussion
To identify the performance and quality of the newly improved estimators, we conducted simulations and analyzed three real-life datasets. The performance comparison was made using the MSE as the key criterion. The MSE values for both the new and existing estimators are listed in Table 2, while the results for the real datasets are shown in Table 4. From these evaluations, the following conclusions can be drawn.
● Findings from both the generated simulations and the actual datasets indicate that the newly introduced estimators consistently yield lower MSE values compared with the overall existing approaches that have already been discussed in the literature. This pattern is clearly reflected in Tables 2 and 4.
● Additionally, the MSE values for the proposed estimators are consistently lower than those for the existing ones, as illustrated by the declining trend of the graph lines in Figures 1 and 3, both for the simulation experiments and real-life data. This observation suggests that the proposed estimators outperform the other ones, with the MSE values for the new estimators showing an inverse pattern compared with those of the existing estimators.
● Figure 2 presents a grouped bar chart comparing the MSE of the proposed estimator across various artificial populations generated from gamma, uniform, and exponential distributions. Each group on the x-axis represents a distribution type, while the two bars within each group correspond to different parameter settings for that distribution (e.g., Gam(4,6) and Gam(8,10) for the gamma distribution). The y-axis shows the MSE values, with lower bars indicating better performance for that estimator. This visualization helps highlight how the estimator performs under different distributional shapes and scales, offering a clear comparison of its efficiency across varied population structures.
● Figure 4 shows a grouped bar chart of the MSE values for 15 estimators across three real datasets. The x-axis indexes the estimators from 1 to 15, while the y-axis uses a logarithmic scale to handle the large variation in MSE values particularly highlighting the contrast between the relatively large MSE values of Dataset 1 and Dataset 2 and the smaller values from Dataset 3. Each group of bars represents an estimator, with different colors indicating the datasets. This visualization allows for easy comparison of estimator performance across datasets, highlighting how some estimators perform consistently better, especially on Dataset 3. From this plot, it is evident that certain estimators (such as Estimators 14 and 15) consistently yield lower MSE values.
7.
Conclusions
This research introduced innovative sets of efficient newly enhanced estimators for determining the population mean, utilizing the minimum and largest values of the independent variable along with its known highest and lowest rank values. Section 5 outlines specific theoretical foundations to compare the proposed estimators with existing methods, demonstrating their effectiveness. To support these findings, simulation experiments were performed, along with an analysis of real-life datasets. The results show that the new estimators consistently perform better than the existing estimators, especially by having a lower mean squared error (MSE), as displayed in Table 2. These findings are also supported by the results in Table 4, which validates the theoretical conditions given in Section 5.
The simulation results and empirical studies clearly show that the suggested estimators ˉydi (i=e,U1,U2,…,U8) are more efficient than the other estimators being considered. We noticed that all the proposed estimators, ˉydU8 proves to be the most efficient option within the suggested group of estimators, making it strongly recommended.
This study introduced an enhanced method for population mean estimation by utilizing the smallest and largest values of the independent variables along with their ranks. The main advantage of this approach lies in its ability to significantly reduce the MSE, particularly in the presence of extreme values, offering improved precision over traditional methods. This method is valuable in fields such as economics, healthcare, public policy, and education, where accurate population estimates are essential. However, the method has some limitations. Its performance may vary, depending on the characteristics of the data, such as distribution and correlation, and the two-phase sampling design may add complexity and cost.
Moreover, our study examined the qualities and efficiency of the newly enhanced estimators by adopting a sampling two-phase technique. This approach allowed us to compare the performance of the enhanced estimators against traditional methods more effectively. Future research may explore the application of these methods in various sampling designs, including stratified and systematic approaches, and investigate optimal sample size allocation strategies to further improve efficiency. The use of additional auxiliary variables and assessment on large-scale datasets may provide deeper insights. Exploring the combination of these estimators with machine learning for adaptive sampling or real-time data processing can enhance estimation precision in dynamic environments. This topic offers an interesting direction for further investigation.
Author contributions
Hleil Alrweili: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing-original draft preparation, writing-review and editing, visualization, project administration; Fatimah A. Almulhim: Conceptualization, methodology, validation, formal analysis, investigation, data curation, writing-original draft preparation, writing-review and editing, visualization, supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.
Use of Generative-AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project. The authors, therefore, acknowledge with thanks Princess Nourah bint Abdulrahman University Researchers Supporting funds, Project number (PNURSP2025R515), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Conflict of interest
The authors declare no conflict of interest.