
In this work, we suggest a reduced distribution with two parameters of the modified Weibull distribution to avoid some estimation difficulties. The hazard rate function of the reduced distribution exhibits decreasing, increasing or bathtub shape. The suggested reduced distribution can be applied to many problems of modelling lifetime data. Some statistical properties of the proposed distribution have been discussed. The maximum likelihood is employed to estimate the model parameters. The Fisher information matrix is derived and then applied to construct confidence intervals for parameters. A simulation is conducted to illustrate the performance of maximum likelihood estimation. Four sets of real data are tested to prove the proposed distribution advantages. According to the statistical criteria, the proposed distribution fits the tested data better than some well-known two-and three-parameter distributions.
Citation: M. G. M. Ghazal, H. M. M. Radwan. A reduced distribution of the modified Weibull distribution and its applications to medical and engineering data[J]. Mathematical Biosciences and Engineering, 2022, 19(12): 13193-13213. doi: 10.3934/mbe.2022617
[1] | Tingxi Wen, Hanxiao Wu, Yu Du, Chuanbo Huang . Faster R-CNN with improved anchor box for cell recognition. Mathematical Biosciences and Engineering, 2020, 17(6): 7772-7786. doi: 10.3934/mbe.2020395 |
[2] | Wei-Min Zheng, Qing-Wei Chai, Jie Zhang, Xingsi Xue . Ternary compound ontology matching for cognitive green computing. Mathematical Biosciences and Engineering, 2021, 18(4): 4860-4870. doi: 10.3934/mbe.2021247 |
[3] | Carlos Castillo-Chavez, Bingtuan Li, Haiyan Wang . Some recent developments on linear determinacy. Mathematical Biosciences and Engineering, 2013, 10(5&6): 1419-1436. doi: 10.3934/mbe.2013.10.1419 |
[4] | Sanket Desai, Nasser R Sabar, Rabei Alhadad, Abdun Mahmood, Naveen Chilamkurti . Mitigating consumer privacy breach in smart grid using obfuscation-based generative adversarial network. Mathematical Biosciences and Engineering, 2022, 19(4): 3350-3368. doi: 10.3934/mbe.2022155 |
[5] | Hasan Zulfiqar, Rida Sarwar Khan, Farwa Hassan, Kyle Hippe, Cassandra Hunt, Hui Ding, Xiao-Ming Song, Renzhi Cao . Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Mathematical Biosciences and Engineering, 2021, 18(4): 3348-3363. doi: 10.3934/mbe.2021167 |
[6] | Bochuan Du, Pu Tian . Factorization in molecular modeling and belief propagation algorithms. Mathematical Biosciences and Engineering, 2023, 20(12): 21147-21162. doi: 10.3934/mbe.2023935 |
[7] | Liang Yu, Zhengkuan Zhang, Yangbing Lai, Yang Zhao, Fu Mo . Edge computing-based intelligent monitoring system for manhole cover. Mathematical Biosciences and Engineering, 2023, 20(10): 18792-18819. doi: 10.3934/mbe.2023833 |
[8] | Ruiping Yuan, Jiangtao Dou, Juntao Li, Wei Wang, Yingfan Jiang . Multi-robot task allocation in e-commerce RMFS based on deep reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(2): 1903-1918. doi: 10.3934/mbe.2023087 |
[9] | Michele La Rocca, Cira Perna . Nonlinear autoregressive sieve bootstrap based on extreme learning machines. Mathematical Biosciences and Engineering, 2020, 17(1): 636-653. doi: 10.3934/mbe.2020033 |
[10] | Qi Cui, Ruohan Meng, Zhili Zhou, Xingming Sun, Kaiwen Zhu . An anti-forensic scheme on computer graphic images and natural images using generative adversarial networks. Mathematical Biosciences and Engineering, 2019, 16(5): 4923-4935. doi: 10.3934/mbe.2019248 |
In this work, we suggest a reduced distribution with two parameters of the modified Weibull distribution to avoid some estimation difficulties. The hazard rate function of the reduced distribution exhibits decreasing, increasing or bathtub shape. The suggested reduced distribution can be applied to many problems of modelling lifetime data. Some statistical properties of the proposed distribution have been discussed. The maximum likelihood is employed to estimate the model parameters. The Fisher information matrix is derived and then applied to construct confidence intervals for parameters. A simulation is conducted to illustrate the performance of maximum likelihood estimation. Four sets of real data are tested to prove the proposed distribution advantages. According to the statistical criteria, the proposed distribution fits the tested data better than some well-known two-and three-parameter distributions.
With the development of society and the improvement of computer storage and data processing ability, the explosive growth of data has become an important feature. Faced with the growing scale of massive data, the storage and management of massive data has been a concern. In order to improve the reliability and access efficiency of HDFS, replica technology is commonly used to copy multiple data files and store them on multiple nodes of the distributed file system. For files of different popularity in distributed system, how to dynamically adjust the number of copies so it can not only ensure the reliability and availability of data, but also improve the data access efficiency? The system should be able to adjust the number of copies when the predicted access popularity value is changed, with the number of copies increased for high popularity files and the number of copies reduced for low popularity files, so as to improve the access efficiency and save storage resources. The solution of the above problems is of great significance in improving performance of the cluster.
File popularity is used to describe the degree to which files are required by users in a time period, and it is defined by the frequency of file access in a certain period of time [1], i.e., the times the file is accessed by the user in unit time. Domestic and foreign scholars mainly use neural network models [2,3], grey models [4], deep learning models and so on in file popularity prediction. In References [5,6], aiming at the shortcomings of static copy strategy, a prediction model based on grey model is proposed to predict the future access popularity of data according to the latest access characteristics of files. The frequency of file access is regarded as the popularity of file access. According to temporal locality, the recently accessed files will be accessed again in a short time, and the algorithm calculated file's accessing temperature by analyzing file's accessing frequency within the specified time, and improved the performance of data service [7,8]. Literature [9,10] used combination forecasting model to deal with the shortcomings of single forecasting model. The neural network model in Reference [11] needs to set up accurate levels and a large number of sample space for training, which will affect the time of file popularity prediction. A method was proposed based on LSTM deep learning model to predict the popularity of files [12], which divides the dynamic time window to construct a time series of file access features, and predicts the access trends of different data. With a subjective and objective popularity calculation method, Reference [13] dynamically predicted the file access volume through LSTM neural network model, calculated the file popularity through the predicted file access volume, and adjusted the copy storage strategy.
In the aspect of replica dynamic management, Veeravalli et al. [14] proposed a CDRM dynamic copy management model, which adjusts the number and storage location of data copies through dynamic changes in node storage space capacity and load conditions. Aiming at the low efficiency of randomly selecting the source node and target node for replication in the default copy creation process of HDFS in Hadoop, Higai et al. [15] proposed a copy management strategy based on a single ring structure. The copy management strategy of this method is 25% more efficient than the default method.
In the current methods of file popularity prediction, the BP neural network prediction tends to get stuck on local optimum, and the accuracy of grey prediction model is not enough. The time feedback model and the prediction method with decay factor are suitable for short-term prediction, while the prediction accuracy is not high. The deep learning LSTM requires a long training time. In the light of the above problems, aiming at the static copy management strategy and the default copy placement strategy, the paper created the unbiased grey model to predict the future access popularity of the file, and proposed an optimization algorithm based on the cuckoo search which can optimize the Markov model to correct the grey prediction value, then the number of copies is adjusted by the popularity prediction value. The main innovations of this paper are listed as follows:
(1) This paper builds a prediction model of file popularity based on the unbiased grey model. This model has the sequence and uncertain character, and has high prediction accuracy.
(2) This paper presents an optimized algorithm. The Markov model is optimized by cuckoo search algorithm to correct the predicted values and improve the accuracy of prediction.
(3) This paper presents a new method to compute the number of copies. This method dynamically adjusts the number of copies and improves the access efficiency.
The remainder of this paper is organized as follows. Section 2 introduces file access popularity prediction method. In Section 3, we introduce how to calculate the number of copies. Section 4 is about the determination of the number of file copies. Section 5 reports experimental results, and conclusions of this paper are made in the last section.
In order to deal with the problem of low access efficiency caused by the sudden increase of hot file access, the method of predicting the popularity of the file at the next moment in advance is adopted to adjust the number of copies in time to improve user access efficiency [16,17]. In the past research, a single method model generally has limitations and doesn't work well in addressing certain problems. Many researchers use the combined model to cope with the limitations of the single model and achieve better results [18]. In our research, the grey model is adopted which can predict the data development trend through a small amount of simple calculation and incomplete information, and predicts the file popularity at the next moment through the latest access characteristics of the file [19]. In order to further improve the accuracy of the prediction data, the cuckoo search algorithm is used to optimize the Markov model and modify the prediction data to improve the prediction accuracy.
The grey model is suitable for the prediction model of the system with uncertain factors. For the grey system with known and unknown information, the grey differential prediction model is established by using a small amount of incomplete information to predict the development and change of the system behavior eigenvalue and the grey process related to time series which changes in a certain range. Grey process finds the law of system change through correlation analysis, establishes the corresponding differential equation model, and predicts the future development trend. The grey prediction model uses equal time interval to obtain a series of quantitative values reflecting the characteristics of the prediction object to construct the grey model prediction data set. This paper takes the file access popularity as the original data, and predicts the future development trend of the file popularity through the grey model processing.
Markov model is a stochastic process of things discovered by Russian mathematicians. The change process of this kind of things is only related to the recent state. Markov model can predict the future development trend according to the probability of transition between different states. In the process of state transition, the state after the nth transition only depends on the previous result state, which has nothing to do with the original state of the system and the Markov process before the transition. In this paper, unbiased grey model is used to predict the file popularity. Because the randomness of user access and the sudden change of file popularity are two inevitable factors, the grey prediction model may produce large errors. Markov model is used to predict the state of file popularity in the future, and the error of file popularity prediction value is corrected.
Cuckoo search (CS) algorithm [20,21] is a natural heuristic algorithm proposed by Yang and Deb. It has the advantages of fast convergence, high efficiency and few adjustment parameters. This method is based on the cuckoo's parasitic brooding behavior and Lévy flight, which is more effective than genetic algorithm and particle swarm optimization. Lévy flight can expand the search range, increase the diversity of the population, and it is easier to jump out of the local point. The algorithm regards all host nests as a generation, and each nest carries bird eggs as a solution. The goal is to find the potential optimal solution through continuous iteration to replace the existing scheme.
This paper uses the cuckoo search algorithm for multi-objective optimization, based on the following three criteria:
(1) Each cuckoo lays K eggs at a time (representing the solution of K goals) and puts them in a randomly selected host nest;
(2) The best nest with high quality eggs (solutions) in each generation is brought to the next generation;
(3) The number of available host nests is fixed, and the probability of finding cuckoo eggs is pa∈(0,1). After discovery, the host can destroy the eggs or give up the old nest to build a new one.
Each egg in the nest represents a solution, and each cuckoo can lay only one egg. The purpose is to use a new or better solution to replace the bad one in the nest. The CS search algorithm is used to explore the relationship between the Markov state interval, the actual value and the predicted value to find the most suitable value of the interval and to modify the predicted value. The problem is a multi-objective optimization problem, and the optimal solution is used to divide the Markov model state interval.
The grey Markov model is widely used in the prediction of small sample and exponential distribution sample data. In a period of time, the access frequency of popularity files is like this. Therefore, the paper establishes the file popularity prediction model from grey model parameters, Markov correction value and prediction sequence update, and finally optimizes the model.
The unbiased grey model is used to process the file access sequence and predict the file popularity in the future. With the help of grey development coefficient α and grey action quantity U of grey model, the original data series fitting model of unbiased grey prediction model is determined, and the future popularity of documents is predicted. The original data sequence of unbiased grey model is the access of a file in the first m equal time periods.
Set the access volume of file I in the kth time period, and establish the original access volume sequence according to the access volume of file I in the first m equal time periods at the current time, the original sequence data is represented by the upper corner mark "0".
f(0)i={f(0)i(1),f(0)i(2),f(0)i(3),…,f(0)i(m)} |
(1) Original sequence accumulation operation. Add the first and second data of the original sequence to get the second data of the accumulated sequence, add the third data and the second data to get the third data and so on. The calculation function is shown in (1).
f(1)i(k)=∑km=1f(1)i(m),(k=1,2,3,…,m) | (1) |
(2) Smoothness test and quasi exponential test. Before establishing the model, we need to test the smoothness of the original sequence and the series test of the accumulated sequence. If it meets the requirements, we can further establish the model.
For smoothness test, time-series smoothness ratio is defined as follows:
ρ(k)=f(0)(k)f(1)(k−1),k=3,4,…,m | (2) |
If f(1)(k)meetsρ(k+1)ρ(k)<1, ρ(k) ∈[0, ϵ], ϵ<0.5,k=2,3,…,m, f(1)(k) is smooth data sequence.
For the quasi exponential test, the class ratio is defined as follows:
σ(k)=f(0)(k)f(0)(k−1),k=3,4,…,m | (3) |
If a sequence ratio satisfies σ(k)∈[1,1.5],k=3,4,…,m, then the sequence f(1)(k) is regarded as exponential distribution.
(3) The trend of accumulation sequence can be described by the whitening differential equation shown in Eq (4).
df(1)i(m)dt+af(1)i(m)=u | (4) |
Where parameters a and u are identified, and a is called development coefficient, u is the amount of grey action.
(4) Model solving. Parameters a and u are computed by the least square method. Let Fm=[f(0)i(2),f(0)i(3),f(0)i(4),…,f(0)i(m)]T, ˆα=(au), ˆα is estimator of parameter vector.
B=[−12[f(1)i(1)+f(1)i(2)]1−12[f(1)i(2)+f(1)i(3)]1⋯⋯−12[f(1)i(n−1)+f(1)i(n)]1] | (5) |
Substituting ˆα into the Eq (6), compute the value Fm[A1].
Fm=Bˆα | (6) |
(5) The whitening differential equation is solved, and Eq (4) is solved to obtain the unbiased grey prediction discrete-time response function, which is shown in formula (7).
ˆf(1)i(k+1)=[f(0)i(1)−ua]e−at+uak=0,1,2,…,m−1 | (7) |
(6) The parameters b and A are calculated according to the estimated values a and u in grey model, and eliminate the inherent deviation. The calculation method is shown in Eq (8).
b=ln2−a2+a,A=2u2+a | (8) |
(7) The unbiased grey prediction model of original data is shown in Eq (9).
ˆf(0)1=f(0)1,ˆf(0)k+1=Aebk,k=1,2,3,…,m | (9) |
By updating the parameters, the unbiased grey prediction model does not need to reduce and restore the prediction data, which simplifies the modeling steps and improves the operation efficiency of the system.
In the Markov model, the error can be corrected by selecting the center value of the state interval where the grey prediction value is located. However, only considering the center value of the state interval may ignore the influence of other error distribution information, and the center value of the state interval cannot properly represent the error state. Therefore, with the help of cuckoo search algorithm, the grey prediction value of Markov model can be modified more accurately.
The partition of state interval is determined by relative precision, which is the difference between actual value and predicted value. Each data is distinguished according to its own relative precision. To adapt to the prediction of different data series, the boundary value of the interval between the first and the last states is selected as the maximum value. The target interval is divided equally into 3–5.
Suppose [ai,bi] is the ith state interval, the objective function makes the error of prediction correction vi as small as possible, and it is defined as follows (10).
vi=αiai+(1−αi)bi,i=1,2,…,n | (10) |
The better parameter αi of each interval is found by Cuckoo search algorithm. When α = 0.5, the correction value is the center value of the state interval.
The specific process of cuckoo search algorithm to optimize Grey-Markov model is as follows.
(1) Location update
In CS algorithm, cuckoos find the nest and lay eggs by the Lévy flight search. The invincible walk is a walk between long and short steps. Therefore, the nest updating position is calculated by Lévy flight as follows:
xt+1i=xti+β⨂Lévy(λ,s) | (11) |
s=u|v|1λ |
Lévy(λ,s)=λΓ(λ)sin(πλ2)πs1+λ,λ∈(1,2) |
Where t is the number of iterations and xti is the position of the ith point at time t and β is the step, u∼N(0,σ2u),v∼N(0,σ2v),σ2u=(Γ(1+λ)sin(πλ2)Γ[2(λ−1)2∙λ(1+λ)2]),σv=1,Γ(z)=∫∞0tz−1etdt, for |s| ≥ |s0|, s submits to Lévy distribution, s0 is the min step, s0≫0, but in fact, if s0 is reasonable, it is feasible, for example s0∈(0.1,1).
The host bird rebuilds the nest after discovering the alien bird with a certain probability pa, because the choice of direction is random, the probability of each direction is the same, obeying the uniform distribution. A relatively small value is selected by a large probability in the Lévy distribution, so this location can be approximately satisfied with the Mantegna algorithm. Therefore, the position of the new bird's nest is defined as follows:
xt+1i=xti+αs⨂H(pa−ϵ)⨂(xtj−xtk) | (12) |
Among them, t is the number of iterations and xti is the ith position at time t, α is the step, s, ϵ are random numbers that obey even distribution, H is the Heaviside function (13) derived from even distribution, ⨂ is the point multiplication, xtj and xtk is the location of two bird nests randomly selected at t time.
Heaviside(x)={1x≥00x<0 | (13) |
(2) Definition of Cuckoo search objective function
The objective function is represented by the mean absolute value of relative percentage error (MAPE). The state interval is divided according to the residual between the grey prediction value and the actual value. Each generation is determined by the minimum MAPE value; the prediction value is modified according to the residual correction after each iteration. The objective function is defined as follows (14).
MAPE=1n∑ni=1(|f(0)i(k)−ˆF(0)i(k)|f(0)i(k)) | (14) |
Where ˆF(0)i(k) is the prediction correction value.
(3) Calculation of correction value
The correction value of each state interval is to calculate by the optimal decision interval coefficient α=(α1,α2,…,αn) in the first step. According to formula (15), the correction values v1,v2,…,vn are obtained.
{v1=α1a1+(1−α1)b1v2=α2a2+(1−α2)b2v3=α3a3+(1−α3)b3 | (15) |
(4) Determination of predicted value
Through the k-step state transition probability matrix and the state Ei of the current data and its initial vector X(0), the state of the predicted value at the next moment can be predicted. Assuming max(Pij)=Pik, the predicted data is most likely to change from state Ei to Ek at the next moment.
Each state interval is divided according to the relative accuracy, and the unbiased grey forecast is corrected according to the relative accuracy to reduce the error and increase the accuracy of the forecast. The formula is as follows (16).
ˆF(0)(m+1)=ˆf(0)(m+1)+vi | (16) |
Among them, vi is the correction value of state i interval, ˆF(0)(m+1) represents the correction value of the final predicted value, which is the final predicted value of file access popularity.
Input: Number of host nests n; number of iterations m; probability of detection pa Output: Optimal solution Xi |
1.begin |
2. Initial population n host nests Xi (i=1,2,…,n); |
3. Calculate fitness Fi( i=1, 2, …, n); |
4. While (not true) |
5. Using Lévy flight to generate a new solution Xi; |
6. Calculate the fitness value Fi of the new solution Xi; |
7. Select candidate solution Xj, |
8. if(Fi > Fj) |
9. Fj replaces Fi; |
10. end |
11. Discard the bad solution according to the detection probability pa; |
12. Generate new nest (solution) by local random search; |
13. Preserve the best solution (Nest of high quality solutions); |
14. Find the current best Xi; |
15. end While |
16. end |
This paper predicts the data file access popularity based on the grey prediction model, and the data file is a small sample. Data access has the time locality characteristic, and the current frequently accessed files will be accessed again in a certain period of time in the future, so the older data can't reflect the current trend, and the data needs to be updated.
Files in the distributed system are being accessed all the time, and the latest file access data will be generated at any time. In order to be able to use historical popularity to predict future file popularity more accurately, we deal with the original data sequence by metabolism according to the idea of high weight of new data in Reference [17]. When updating the data sequence, add the latest file popularity to the original sequence and eliminate the old popularity data. Repeat the above operations according to the file statistical cycle to update the file popularity forecast sequence. It can better reflect the recent trend of the file's popularity, and more accurately predict the file's future popularity.
In general, the number of replicas is 3 in HDFS replica management policy, which is inefficient when a large number of files are accessed. Dynamic adjustment of replica number based on file popularity and file system availability can provide reliable and fast data access environment for users and improve efficiency in distributed file system.
In the static replica management strategy, the replica number is 3 and it is fixed value, which will not change in the whole life cycle from replica creation to destruction. However, in fact, the fixed number of copies does not meet the requirement, and it should be dynamically adjusted according to the user's access characteristics and demand changes [19]. Appropriate number of copies can not only improve the readability and availability of data, but also improve the task execution ability of MapReduce. With the dynamic change of file popularity, the copy number of high popularity files should be increased dynamically to improve the transaction processing ability of the system under concurrent access. When the file popularity is reduced, the copy number should be reduced to save storage space and maintenance cost.
The replica adjustment strategy includes replication, keeping and deletion of copies. Replication is to increase the number of copies and copy the copies of files to new nodes, in order to respond to a large number of access requests of files in time; copy keeping means that the number of copies of the document remains unchanged; copy deletion is to delete some copies of files on the premise of ensuring storage reliability when the popularity of file access decreases [22].
Because the current prediction number of replicas is lagged, we should adjust the number of replicas with the prediction of the number at the next moment, so that we can solve the cyber source competition and congestion caused by the sudden access, and reduce the task execution delay caused by the request source competition. The strategy predicts the file access heat of the next time according to the previous n access heat values in the distributed file system before the current time, so as to adjust the number of copies.
The number of copies is determined by the relationship between the predicted access heat of files and the average access heat of all files in the system. In order to ensure the data reliability of cloud storage system, each file has the minimum number of copies C, C = 2. The number of replicas for each file is defined by the formula (17).
Nadjust(i)=⌈C×H(i)Havg⌉−Nexist(i) | (17) |
Where, Nadjust(i) is the number of copies of file i, H(i) is the file popularity of file i predicted by the prediction model at the next moment, Havg is the average access popularity of all files in the distributed file system, Nexist(i) is the number of existing copies of file i.
Most researchers consider the file popularity rather than the availability of cloud computing system in the fixed number of copies in the static copy strategy. Through the quantitative analysis of system availability parameters and the availability of data blocks, the relationship function between the availability of file system and the number of copies is obtained.
(1) In the distributed file system, suppose that the file F is divided into n data blocks {b1,b2,…,bn}, which are stored on different nodes. Assuming that the data block bk has rj copies, and the rj copies are placed on m nodes (j ≥ m), only when all the copy storage nodes of the data block are unavailable, the data block is bad. Because each node in the distributed file system is independent each other, P(Bk) is the availability probability of data block k, and P(Nk) is the availability probability of data node k. Therefore, the calculation for the unavailability probability of the data block bk is shown in (18).
P(−Bk)=P(−N1)×P(−N2)×⋯×P(−Nm) | (18) |
In the cloud storage system, the reliability of nodes changes exponentially with time. In a distributed file system, it is assumed that the reliability of data nodes is an exponential distribution (19) in the life cycle T of the file.
f(T)=e−αT | (19) |
Where α is the failure rate of the node in the file system, and f(T) is the effective expected value of the node in the life cycle of the file.
(2) The availability of a file is that every data block of the file is available. The failure of any data block will lead to the unavailability of the file. The file availability calculation formula is shown in (20).
P(−F)=P(−B1∪−B2∪⋯∪−Bn) | (20) |
Each data block is independent of each other, and the formula (21) is obtained.
P(−F)=∑ni=1(−1)i+1Cin(∏rjl=1(1−e−αlTrj))i | (21) |
(3) The minimum number of copies can be obtained according to the file availability. The formula is shown in (22).
P(F)=1−P(−F)≥Pstandard | (22) |
Where Pstandard is the standard value of file availability.
The final number of copy factors is determined by integrating the copy adjustment model based on file popularity. The 4.2 can ensure the minimum value of the copy numbers under the condition of file reliability, combined with the copy numbers obtained by the file popularity decision, and under the condition of ensuring file reliability, adjust the file copy numbers to improve the system response efficiency and system storage space utilization while ensuring the reliability of the distributed file system. The specific process of the dynamic adjustment of the copy numbers is shown in Figure 1. The minimum number of copies n1 is calculated under the condition of ensuring the availability of the file, and the copy number n2 is obtained by combining the relationship between the predicted value of the file's future popularity and the average value of all files in the cluster. When the number of copies needs to be increased, the larger value is selected between n1 and n2. When we reduce the number of copies, the reduction n3 is equal to the original number of copies minus the forecast numbers based on file popularity prediction. When n3 is less than 2, the minimum number of copies is 2 by the multiple copy mechanism of the distributed file system.
In this section, we present a series of experiments to verify the algorithm innovation points in this paper. In order to achieve a real simulation of the changes in user access to files, we collect the data from real-time data of a social communication software, as shown in Table 1. The data to the user's access to a certain file (file popularity) is collected every ten minutes, which is used as the original data sequence before processing. First, accumulate the original sequence to get a new sequence, and perform smooth and exponential detection on the new sequence. Through formulae (2) and (3) and test standards, it is verified that the cumulative sequence is a smooth sequence and has an exponential distribution.
Serial number/period | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Pageview/ten thousand times | 324.01 | 366.7 | 447.19 | 370.52 | 289.23 | 382.11 | 388.01 | 381.45 |
CloudSim4.0 is used as the platform to verify the effect of the dynamic HDFS copies placement strategy based on file access popularity. The operating system is Windows 10. The development tool is Eclipse. The version of JDK is 1.8.0. In the experiment, ten simulated heterogeneous data nodes are distributed on three racks, with nodes marked as DN1 to DN10 and racks marked as Rack0, Rack1 and Rack2. One rack contains 4 nodes and the other two racks contain 3 nodes respectively. The detailed information of the nodes and racks is shown in Table 2.
DataNode | CPU | RAM/MB | Disk Space /GB | Rack | |
Basic Frequency/GHz | core | ||||
1 | 2.0 | 8 | 6144 | 1024 | 0 |
2 | 2.5 | 6 | 4096 | 512 | 0 |
3 | 2.0 | 6 | 8192 | 1024 | 0 |
4 | 2.5 | 4 | 4096 | 512 | 0 |
5 | 2.5 | 4 | 4096 | 1024 | 1 |
6 | 2.0 | 8 | 8192 | 512 | 1 |
7 | 2.0 | 6 | 6144 | 1024 | 1 |
8 | 2.5 | 4 | 4096 | 512 | 2 |
9 | 2.5 | 8 | 8192 | 1024 | 2 |
10 | 2.0 | 6 | 4096 | 512 | 2 |
The experiments are listed as follows:
• Experiment 1: State interval division verification;
• Experiment 2: Unbiased grey prediction model optimized by CS verification;
• Experiment 3: File copies number verification;
• Experiment 4: File copies adjustment efficiency verification;
• Experiment 5: Performance test and result analysis.
According to the grey model and Markov model, we have realized the prediction of future file visits to the original data sequence. Among them, a=0.0061, u=385.1279. Unbiased grey prediction model parameters b=−0.0061,A=383.9498. According to the prediction and fitting results to the original data sequence, we select the two maxima of the residuals as the boundary of the state interval, and they are equally divided into three intervals, corresponding to the three states of E1, E2 and E3. The number of states is the same as the nest number of the CS algorithm. The division of the state interval is shown in Table 3 and the residual of the unbiased grey Markov model are given respectively in Table 4.
State | Interval |
E1 | [−85.41, −34.30) |
E2 | [−34.30, 16.81) |
E3 | [16.81, 67.92) |
Serial number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
residual | 0 | −14.90 | 67.92 | −6.43 | −85.41 | 9.76 | 17.94 | 13.64 |
status | E2 | E2 | E3 | E2 | E1 | E2 | E3 | E2 |
(1) The parameters of CS algorithm
Considering the efficiency and accuracy of the algorithm, the parameters of CS algorithm are obtained after many experiments. The parameter settings are shown in Table 5.
Parameter | Nest capacity | Nest number | Step | Random step | Detection probability | Iterations | |
value | 3 | 25 | 0.01 | 1.5 | 0.25 | 200 |
(2) Evaluation of experimental results
The prediction model is evaluated by MAPE (mean absolute percentage error). MAPE can not only determine whether the prediction effect of the model is good, but also can be used as the basis for the comparison between the models.
(3) The model correction value
Through the CS algorithm iterative optimization, when the number of iterations is 100, it can show the expected effect. Considering the accuracy and the efficiency of the experiment, the number of iterations is 200, and α=(α1,α2,α3)=(0.9883,0.4183,0.4889), and the optimized Markov correction value v=(−84.81,−4.56,42.93).
Note: The Markov residual correction value obtained from the experiment is only suitable for the current prediction sequence. When the prediction sequence changes, it is necessary to recalculate α to achieve the optimal effect.
(4) Comparison of experimental data
Table 6 shows the comparison of experimental data based on Grey Markov model, unbiased grey Markov model and CS unbiased grey Markov model.
Sequence number | xi | Grey Markov model | Unbiased Grey Markov model | CS Unbiased Grey Markov model | ||||||
yi | ei | ri | yi | ei | ri | yi | ei | ri | ||
1 | 324.01 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 |
2 | 366.70 | 381.97 | —15.27 | 0.0168 | 372.45 | −5.75 | 0.0157 | 376.94 | —10.24 | 0.0279 |
3 | 447.19 | 379.63 | 67.56 | 0.0572 | 421.64 | 25.55 | 0.0571 | 422.20 | 24.99 | 0.0559 |
4 | 370.52 | 377.31 | −6.79 | 0.0000 | 367.80 | 2.72 | 0.0073 | 372.29 | −1.77 | 0.0048 |
5 | 289.23 | 375.00 | —85.77 | 0.0883 | 314.78 | —25.55 | 0.0883 | 289.83 | −0.60 | 0.0021 |
6 | 382.11 | 372.70 | 9.41 | 0.0485 | 363.24 | 18.87 | 0.0494 | 367.69 | 14.42 | 0.0377 |
7 | 388.01 | 370.42 | 17.59 | 0.0629 | 412.07 | —24.06 | 0.0620 | 413.00 | —24.99 | 0.0644 |
8 | 381.45 | 368.16 | 13.29 | 0.0587 | 358.66 | 22.79 | 0.0597 | 363.15 | 18.30 | 0.0480 |
Prediction value | 362.89 | 356.80 | 6.09 | 0.0168 | 356.41 | 6.48 | 0.0179 | 361.00 | 1.89 | 0.0052 |
MAPE | 4.36% | 4.47% | 3.08% | |||||||
Note: The xiandyi unit is ten thousand times in Table 6. |
Let xi is the ith element of the original sequence, yi is the predicted value of xi, ei is the error, ri is the relative error. The error and the relative error can be defined as follows:
ei=yi−xii=1,2,…n |
ri=|ei|xii=1,2,……n | (23) |
Finally, MAPE is used to evaluate the fitting degree of the three models.
From Table 6, we can see that the optimization effect of the CS Unbiased Grey Markov model is superior to the Grey Markov model and the unbiased grey Markov model. the MAPE of the Grey Markov model and the Unbiased Grey Markov model are both about 4.4%, indicating that the Markov correction model has a good effect on reducing the error of the grey model. The Markov model optimized by the CS algorithm improves the error correction, and the MAPE is only 3.08%.
In order to further verify that CS unbiased grey Markov model can effectively predict the next moment data in the case of small samples. Using the same group of data, we do the experiment based on the grey model, unbiased grey model, grey Markov model, unbiased grey Markov model, BP neural network and grey neural network. The experimental results are shown in Figure 2 and Table 7.
Model | Grey model | Unbiased grey model | Grey Markov model | Unbiased grey Markov model | Grey neural network | CS Unbiased Grey Markov model |
MAPE | 7.76% | 7.75% | 4.36% | 4.47% | 3.35% | 3.08% |
From Table 7, it is concluded that the prediction accuracy of grey model and unbiased grey model is not much different, but the unbiased grey prediction model doesn't need to carry out incremental operation for the prediction of file popularity, so the calculation efficiency will be higher. In the case of small sample, the MAPE of CS unbiased grey Markov model is the smallest compared with other models, that is, the accuracy of data prediction is higher. It shows that the model can better predict the short-term data, and provide a good method for coping with future environmental changes and taking corresponding measures in advance.
Table 8 shows the comparison of MAPE values between our method and other algorithms. In the time feedback model, the prediction is more accurate in a short period of time, but the effect becomes worse in a long period of time. The model with the decay factor has a higher prediction accuracy when the change is relatively uniform, and the prediction accuracy decreases when there is a sharp change. The MAPE value of the BP neural network model is between the time feedback model and the CS unbiased grey Markov model. Our method achieves a relatively good effect.
Model | Model with decay factor | Time feedback model | BP neural network | Our method |
MAPE | 8.46% | 3.48% | 4.33% | 3.08% |
CloudSim simulation platform is used for simulation experiment, and HDFS simulation is realized by extending the distributed file system. The prediction of file popularity uses multiple clients to write 10 files into the cluster at the same time. The number of copies obtained based on file popularity is shown in Table 9. Firstly, the CS unbiased grey Markov model is used to predict the file popularity in the future, and according to the future file access heat and the availability of the file, the change of the copy is obtained. When the file access volume increases, the number of copies needs to be increased, instead, the number of copies reduced. However, to ensure the availability of files, when the number of file copies is less than the default value, select the default value directly.
File | Initial file popularity | Predict file popularity at t | Initial number of copies | Number of copies at t | Copies change |
1 | 444.01 | 690.09 | 2 | 3 | 1 |
2 | 551.56 | 499.9 | 3 | 2 | −1 |
3 | 869.73 | 489.99 | 4 | 2 | −2 |
4 | 467.32 | 1279.66 | 2 | 5 | 3 |
5 | 365.01 | 657.36 | 2 | 3 | 1 |
6 | 152.33 | 344.98 | 2 | 2 | 0 |
7 | 240.87 | 689.76 | 2 | 3 | 1 |
8 | 268.56 | 578.66 | 2 | 3 | 1 |
9 | 446.34 | 267.99 | 2 | 2 | 0 |
10 | 604.56 | 200.98 | 3 | 2 | −1 |
The experiment shows that when the file popularity increases, such as file 4, the number of copies increases; when the access popularity decreases, the number of files decreases, such as file 3. In order to ensure the availability of file copies, when the number of file copies are less than the default value, the number of copies does not change, such as file 9.
We verify the performance of the system after replica factor adjustment by setting different task quantities, executing tasks for 10 times and recording the average completion time with different task quantities. As is shown in Figure 3.
Through the analysis of the experimental results in Figure 3, the adjustment of the hot file replica factor has improved the system computing efficiency. The multi-factor node evaluation algorithm (MFNEA) is used to optimize the dynamic placement strategy of the new replica factor, which can promote the overall computing efficiency of the system, make full use of the computing power of the entire cluster node, and then effectively improve the work execution efficiency of MapReduce.
In the simulation experiment, 600 files were simultaneously written to the cluster by multiple clients, and the default copy placement strategy was adopted for initial writing of files. When the system runs for a period of time, the access popularity of files changes. The default copy placement strategy is compared with the copy placement strategy with MFNEA. The storage resource usage of each node in the cluster is shown in Figure 4.
The "default method" represents the default copy placement strategy in the distributed file system, and Balancer is used to adjust the copy distribution. Based on the CPU load, memory load and disk load, MFNEA determines the node performance with the method of analytic hierarchy process. MFNEA evaluates the node in a linear weighted way, with the copy placement node selected according to node evaluation value. It can be seen from the figure that the disk space usage of each node in the default copy placement strategy varies greatly. The highest percent of node usage is 81%, while some nodes are only about 30%, which is relatively idle, causing serious imbalanced copy distribution. The MFNEA balances the storage space utilization of each node in the cluster, with the node utilization within 20%.
The number of copies will directly affect whether it can provide users with a reliable and fast data access environment in the copy management strategy. Because the file popularity changes dynamically in the system, the static copy decision cannot adapt to the changing demand of file access. The paper designs a file copy dynamic adjustment strategy. Firstly, the recent access characteristics of the file are obtained, the unbiased grey model is used to predict the file access popularity at the next moment, and the Markov model optimized by the cuckoo search algorithm is used to correct the predicted value to increase the accuracy of the predicted value. The number of copies is calculated based on file popularity and file availability, with the number of copies reduced for low-popularity files, which improves the storage space utilization in the distributed file system, and accordingly the number of copies increased for high-popularity files to improve system performance and system reliability.
This work was supported by Hebei Urban Forest Health Technology Innovation Center, Hebei Key Laboratory of Agricultural Big Data, National Natural Science Foundation of China (No. 62102130), Central Government Guides Local Science and Technology Development Fund Project (No. 226Z0201G), Precision Animal Husbandry Discipline Group Construction Project of Hebei Agricultural University (No. 1090064), Natural Science Foundation of Hebei Province (No. F2020204003), Science and Technology Research Foundation of Hebei Province (No. ZD2016158), Key Research and Development Program of Hebei Province (No. 19220119D), Hebei Youth Talents Support Project (No. BJ2019008), Research Project of Hebei Provincial Department of Human Resources and Social Security (No. JRS-2021-5024), Introducing Talent Research Project of Hebei Agricultural University (No. YJ201829).
The authors declare there is no conflict of interest.
[1] |
W. Weibull, A Statistical distribution function of wide applicability, J. Appl. Mech., 18 (1951), 293-297. https://doi.org/10.1115/1.4010337 doi: 10.1115/1.4010337
![]() |
[2] |
M. Xie, Y. Tang, T. N. Goh, A modified Weibull extension with bathtub-shaped failure rate function, Reliab. Eng. Syst. Saf., 76 (2002), 279-285. https://doi.org/10.1016/S0951-8320(02)00022-4 doi: 10.1016/S0951-8320(02)00022-4
![]() |
[3] |
C. D. Lai, M. Xie, D. N. P. Murthy, A modified Weibull distribution, IEEE Trans. Reliab., 52 (2003), 33-37. https://doi.org/10.1109/TR.2002.805788 doi: 10.1109/TR.2002.805788
![]() |
[4] |
M. Bebbington, C. D. Lai, R. Zitikis, A flexible Weibull extension, Reliab. Eng. Syst. Saf., 92 (2007), 719-726. https://doi.org/10.1016/j.ress.2006.03.004 doi: 10.1016/j.ress.2006.03.004
![]() |
[5] |
M. Carrasco, E. M. Ortega, G. A. Paula, Log-modified Weibull regression models with censored data: Sensitivity and residual analysis, Comput. Stat. Data Anal., 52 (2008), 4021-4039. https://doi.org/10.1016/j.csda.2008.01.027 doi: 10.1016/j.csda.2008.01.027
![]() |
[6] |
M. Carrasco, E. M. Ortega, G. M. Cordeiro, A generalized modified Weibull distribution for lifetime modeling, Comput. Stat. Data Anal., 53 (2008), 450-462. https://doi.org/10.1016/j.csda.2008.08.023 doi: 10.1016/j.csda.2008.08.023
![]() |
[7] |
G. O. Silva, E. M. Ortega, G. M. Cordeiro, The beta modified Weibull distribution, Lifetime Data Anal., 16 (2010), 409-430. https://doi.org/10.1007/s10985-010-9161-1 doi: 10.1007/s10985-010-9161-1
![]() |
[8] |
A. M. Sarhan, J. Apaloo, Exponentiated modified Weibull extension distribution, Reliab. Eng. Syst. Saf., 112 (2013), 137-144. https://doi.org/10.1016/j.ress.2012.10.013 doi: 10.1016/j.ress.2012.10.013
![]() |
[9] |
S. J. Almalki, J. Yuan, A new modified Weibull distribution, Reliab. Eng. Syst. Saf., 111 (2013), 164-170. https://doi.org/10.1016/j.ress.2012.10.018 doi: 10.1016/j.ress.2012.10.018
![]() |
[10] |
G. M. Cordeiro, E. M. Ortega, G. O. Silva, The Kumaraswamy modified Weibull distribution: theory and applications, J. Stat. Comput. Simul., 84 (2014), 1387-1411. https://doi.org/10.1080/00949655.2012.745125 doi: 10.1080/00949655.2012.745125
![]() |
[11] |
B. He, W. Cui, X. Du, An additive modified Weibull distribution, Reliab. Eng. Syst. Saf., 145 (2016), 28-37. https://doi.org/10.1016/j.ress.2015.08.010 doi: 10.1016/j.ress.2015.08.010
![]() |
[12] |
S. F. Bagheri, E. B. Samani, M. Ganjali, The generalized modified Weibull power series distribution: Theory and applications, Comput. Stat. Data Anal., 94 (2016), 136-160. https://doi.org/10.1016/j.csda.2015.08.008 doi: 10.1016/j.csda.2015.08.008
![]() |
[13] |
A. A. Ahmad, M. G. Ghazal, Exponentiated additive Weibull distribution, Reliab. Eng. Syst. Saf., 193 (2020), 106663. https://doi.org/10.1016/j.ress.2019.106663 doi: 10.1016/j.ress.2019.106663
![]() |
[14] |
A. Z. Afify, M. Nassar, G. M. Cordeiro, D. Kumar, The Weibull Marshall-Olkin Lindley distribution: properties and estimation, J. Taib. Univ. Sci., 14 (2020), 192-204. https://doi.org/10.1080/16583655.2020.1715017 doi: 10.1080/16583655.2020.1715017
![]() |
[15] |
S. J. Almalki, A reduced new modified Weibull distribution, Comm. Stat. Theory Meth., 47 (2018), 2297-2313. https://doi.org/10.1080/03610926.2013.857416 doi: 10.1080/03610926.2013.857416
![]() |
[16] |
M. K. Shakhatreh, A. J. Lemonte, G. M. Arenas, The log-normal modified Weibull distribution and its reliability implications, Reliab. Eng. Syst. Saf., 188 (2019), 6-22. https://doi.org/10.1016/j.ress.2019.03.014 doi: 10.1016/j.ress.2019.03.014
![]() |
[17] | W. Q. Meeker, L. A. Escobar, Statistical Methods for Reliability Data, John Wiley, New York, 1998. |
[18] |
A. M. Abouammoh, S. A. Abdulghani, I. S. Qamber, On partial orderings and testing of new better than renewal used classes, Reliab. Eng. Syst. Saf., 43 (1994), 37-41. https://doi.org/10.1016/0951-8320(94)90094-9 doi: 10.1016/0951-8320(94)90094-9
![]() |
[19] | T. Zhang, R. Dwight, K. El-Akruti, On a Weibull related distribution model with decreasing, increasing and upside-down bathtub-shaped failure rate, in Reliability Maintainability Symposium, 43 (2013), 28-31. https: //doi.org/10.1109/RAMS.2013.6517749 |
[20] |
K. Xu, M. Xie, L. C. Tang, S. L. Ho, Application of neural networks in forecasting engine systems reliability, Appl. Soft Comput., 2 (2003), 255-268. https://doi.org/10.1016/S1568-4946(02)00059-5 doi: 10.1016/S1568-4946(02)00059-5
![]() |
[21] |
M. V. Aarset, How to identify bathtub hazard rate, IEEE Trans. Reliab., 36 (1987), 106-108. https://doi.org/10.1109/TR.1987.5222310 doi: 10.1109/TR.1987.5222310
![]() |
[22] |
R. D. Gupta, D. Kundu, Exponentiated exponential family: An alternative to gamma and Weibull distributions, Biom. J., 43 (2001), 117-130. https://doi.org/10.1002/1521-4036(200102)43:1<117::AID-BIMJ117>3.0.CO;2-R doi: 10.1002/1521-4036(200102)43:1<117::AID-BIMJ117>3.0.CO;2-R
![]() |
[23] |
D. Kundu, M. Z. Raqab, Generalized Rayleigh distribution: Different methods of estimations, Comput. Stat. Data Anal., 49 (2005), 187-200. https://doi.org/10.1016/j.csda.2004.05.008 doi: 10.1016/j.csda.2004.05.008
![]() |
[24] |
M. E. Ghitany, D. K. Al-Mutairia, N. Balakrishnan, L. J. Al-Enezia, Power Lindley distribution and associated inference, Comput. Stat. Data Anal., 64 (2013), 20-33. https://doi.org/10.1016/j.csda.2013.02.026 doi: 10.1016/j.csda.2013.02.026
![]() |
[25] |
A. J. Lemonte, A new exponential-type distribution with constant, decreasing, increasing, upside-down bathtub and bathtub-shaped failure rate function, Comput. Stat. Data Anal., 62 (2013), 149-170. https://doi.org/10.1016/j.csda.2013.01.011 doi: 10.1016/j.csda.2013.01.011
![]() |
[26] |
H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. Cont., 19 (1974), 716-723. https://doi.org/10.1109/TAC.1974.1100705 doi: 10.1109/TAC.1974.1100705
![]() |
[27] |
C. M. Hurvich, C. L. Tsai, Regression and time series model selection in small samples, Biometrika, 76 (1989), 297-307. https://doi.org/10.1093/biomet/76.2.297 doi: 10.1093/biomet/76.2.297
![]() |
[28] |
G. Schwarz, Estimating the Dimension of a Model, Ann. Stat., 6 (1978), 461-464. https://doi.org/10.1214/aos/1176344136 doi: 10.1214/aos/1176344136
![]() |
[29] |
E. J. Hannan, B. G. Quinn, The determination of the order of an autoregression, J. R. Stat. Soc. Ser. B, 41 (1979), 190-195. https://doi.org/10.1111/j.2517-6161.1979.tb01072.x doi: 10.1111/j.2517-6161.1979.tb01072.x
![]() |
[30] |
G. Chen, N. Balakrishnan, A general purpose approximate goodness-of-fit test, J. Qual. Tech., 27 (1995), 154-161. https://doi.org/10.1080/00224065.1995.11979578 doi: 10.1080/00224065.1995.11979578
![]() |
[31] |
B. Bergman, B. Klefsjo, The total time on test concept and its use in reliability theory, Oper. Res., 32 (1984), 596-606. https://doi.org/10.1287/opre.32.3.596 doi: 10.1287/opre.32.3.596
![]() |
1. | Marco Kemmerling, Daniel Lütticke, Robert H. Schmitt, Beyond games: a systematic review of neural Monte Carlo tree search applications, 2024, 54, 0924-669X, 1020, 10.1007/s10489-023-05240-w | |
2. | weijin mao, Wenzhen Wu, Pierluigi Siano, Hazlie Mokhlis, 2024, A short-term prediction model for photovoltaic power forecasting based on CEEMDAN-CS-LSTM, 9781510679795, 394, 10.1117/12.3024734 | |
3. | Yixin Zhuo, Ling Li, Jian Tang, Wenchuan Meng, Zhanhong Huang, Kui Huang, Jiaqiu Hu, Yiming Qin, Houjian Zhan, Zhencheng Liang, Optimal real-time power dispatch of power grid with wind energy forecasting under extreme weather, 2023, 20, 1551-0018, 14353, 10.3934/mbe.2023642 |
Serial number/period | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Pageview/ten thousand times | 324.01 | 366.7 | 447.19 | 370.52 | 289.23 | 382.11 | 388.01 | 381.45 |
DataNode | CPU | RAM/MB | Disk Space /GB | Rack | |
Basic Frequency/GHz | core | ||||
1 | 2.0 | 8 | 6144 | 1024 | 0 |
2 | 2.5 | 6 | 4096 | 512 | 0 |
3 | 2.0 | 6 | 8192 | 1024 | 0 |
4 | 2.5 | 4 | 4096 | 512 | 0 |
5 | 2.5 | 4 | 4096 | 1024 | 1 |
6 | 2.0 | 8 | 8192 | 512 | 1 |
7 | 2.0 | 6 | 6144 | 1024 | 1 |
8 | 2.5 | 4 | 4096 | 512 | 2 |
9 | 2.5 | 8 | 8192 | 1024 | 2 |
10 | 2.0 | 6 | 4096 | 512 | 2 |
State | Interval |
E1 | [−85.41, −34.30) |
E2 | [−34.30, 16.81) |
E3 | [16.81, 67.92) |
Serial number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
residual | 0 | −14.90 | 67.92 | −6.43 | −85.41 | 9.76 | 17.94 | 13.64 |
status | E2 | E2 | E3 | E2 | E1 | E2 | E3 | E2 |
Parameter | Nest capacity | Nest number | Step | Random step | Detection probability | Iterations | |
value | 3 | 25 | 0.01 | 1.5 | 0.25 | 200 |
Sequence number | xi | Grey Markov model | Unbiased Grey Markov model | CS Unbiased Grey Markov model | ||||||
yi | ei | ri | yi | ei | ri | yi | ei | ri | ||
1 | 324.01 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 |
2 | 366.70 | 381.97 | —15.27 | 0.0168 | 372.45 | −5.75 | 0.0157 | 376.94 | —10.24 | 0.0279 |
3 | 447.19 | 379.63 | 67.56 | 0.0572 | 421.64 | 25.55 | 0.0571 | 422.20 | 24.99 | 0.0559 |
4 | 370.52 | 377.31 | −6.79 | 0.0000 | 367.80 | 2.72 | 0.0073 | 372.29 | −1.77 | 0.0048 |
5 | 289.23 | 375.00 | —85.77 | 0.0883 | 314.78 | —25.55 | 0.0883 | 289.83 | −0.60 | 0.0021 |
6 | 382.11 | 372.70 | 9.41 | 0.0485 | 363.24 | 18.87 | 0.0494 | 367.69 | 14.42 | 0.0377 |
7 | 388.01 | 370.42 | 17.59 | 0.0629 | 412.07 | —24.06 | 0.0620 | 413.00 | —24.99 | 0.0644 |
8 | 381.45 | 368.16 | 13.29 | 0.0587 | 358.66 | 22.79 | 0.0597 | 363.15 | 18.30 | 0.0480 |
Prediction value | 362.89 | 356.80 | 6.09 | 0.0168 | 356.41 | 6.48 | 0.0179 | 361.00 | 1.89 | 0.0052 |
MAPE | 4.36% | 4.47% | 3.08% | |||||||
Note: The xiandyi unit is ten thousand times in Table 6. |
Model | Grey model | Unbiased grey model | Grey Markov model | Unbiased grey Markov model | Grey neural network | CS Unbiased Grey Markov model |
MAPE | 7.76% | 7.75% | 4.36% | 4.47% | 3.35% | 3.08% |
Model | Model with decay factor | Time feedback model | BP neural network | Our method |
MAPE | 8.46% | 3.48% | 4.33% | 3.08% |
File | Initial file popularity | Predict file popularity at t | Initial number of copies | Number of copies at t | Copies change |
1 | 444.01 | 690.09 | 2 | 3 | 1 |
2 | 551.56 | 499.9 | 3 | 2 | −1 |
3 | 869.73 | 489.99 | 4 | 2 | −2 |
4 | 467.32 | 1279.66 | 2 | 5 | 3 |
5 | 365.01 | 657.36 | 2 | 3 | 1 |
6 | 152.33 | 344.98 | 2 | 2 | 0 |
7 | 240.87 | 689.76 | 2 | 3 | 1 |
8 | 268.56 | 578.66 | 2 | 3 | 1 |
9 | 446.34 | 267.99 | 2 | 2 | 0 |
10 | 604.56 | 200.98 | 3 | 2 | −1 |
Serial number/period | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Pageview/ten thousand times | 324.01 | 366.7 | 447.19 | 370.52 | 289.23 | 382.11 | 388.01 | 381.45 |
DataNode | CPU | RAM/MB | Disk Space /GB | Rack | |
Basic Frequency/GHz | core | ||||
1 | 2.0 | 8 | 6144 | 1024 | 0 |
2 | 2.5 | 6 | 4096 | 512 | 0 |
3 | 2.0 | 6 | 8192 | 1024 | 0 |
4 | 2.5 | 4 | 4096 | 512 | 0 |
5 | 2.5 | 4 | 4096 | 1024 | 1 |
6 | 2.0 | 8 | 8192 | 512 | 1 |
7 | 2.0 | 6 | 6144 | 1024 | 1 |
8 | 2.5 | 4 | 4096 | 512 | 2 |
9 | 2.5 | 8 | 8192 | 1024 | 2 |
10 | 2.0 | 6 | 4096 | 512 | 2 |
State | Interval |
E1 | [−85.41, −34.30) |
E2 | [−34.30, 16.81) |
E3 | [16.81, 67.92) |
Serial number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
residual | 0 | −14.90 | 67.92 | −6.43 | −85.41 | 9.76 | 17.94 | 13.64 |
status | E2 | E2 | E3 | E2 | E1 | E2 | E3 | E2 |
Parameter | Nest capacity | Nest number | Step | Random step | Detection probability | Iterations | |
value | 3 | 25 | 0.01 | 1.5 | 0.25 | 200 |
Sequence number | xi | Grey Markov model | Unbiased Grey Markov model | CS Unbiased Grey Markov model | ||||||
yi | ei | ri | yi | ei | ri | yi | ei | ri | ||
1 | 324.01 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 | 324.01 | 0.00 | 0.0000 |
2 | 366.70 | 381.97 | —15.27 | 0.0168 | 372.45 | −5.75 | 0.0157 | 376.94 | —10.24 | 0.0279 |
3 | 447.19 | 379.63 | 67.56 | 0.0572 | 421.64 | 25.55 | 0.0571 | 422.20 | 24.99 | 0.0559 |
4 | 370.52 | 377.31 | −6.79 | 0.0000 | 367.80 | 2.72 | 0.0073 | 372.29 | −1.77 | 0.0048 |
5 | 289.23 | 375.00 | —85.77 | 0.0883 | 314.78 | —25.55 | 0.0883 | 289.83 | −0.60 | 0.0021 |
6 | 382.11 | 372.70 | 9.41 | 0.0485 | 363.24 | 18.87 | 0.0494 | 367.69 | 14.42 | 0.0377 |
7 | 388.01 | 370.42 | 17.59 | 0.0629 | 412.07 | —24.06 | 0.0620 | 413.00 | —24.99 | 0.0644 |
8 | 381.45 | 368.16 | 13.29 | 0.0587 | 358.66 | 22.79 | 0.0597 | 363.15 | 18.30 | 0.0480 |
Prediction value | 362.89 | 356.80 | 6.09 | 0.0168 | 356.41 | 6.48 | 0.0179 | 361.00 | 1.89 | 0.0052 |
MAPE | 4.36% | 4.47% | 3.08% | |||||||
Note: The xiandyi unit is ten thousand times in Table 6. |
Model | Grey model | Unbiased grey model | Grey Markov model | Unbiased grey Markov model | Grey neural network | CS Unbiased Grey Markov model |
MAPE | 7.76% | 7.75% | 4.36% | 4.47% | 3.35% | 3.08% |
Model | Model with decay factor | Time feedback model | BP neural network | Our method |
MAPE | 8.46% | 3.48% | 4.33% | 3.08% |
File | Initial file popularity | Predict file popularity at t | Initial number of copies | Number of copies at t | Copies change |
1 | 444.01 | 690.09 | 2 | 3 | 1 |
2 | 551.56 | 499.9 | 3 | 2 | −1 |
3 | 869.73 | 489.99 | 4 | 2 | −2 |
4 | 467.32 | 1279.66 | 2 | 5 | 3 |
5 | 365.01 | 657.36 | 2 | 3 | 1 |
6 | 152.33 | 344.98 | 2 | 2 | 0 |
7 | 240.87 | 689.76 | 2 | 3 | 1 |
8 | 268.56 | 578.66 | 2 | 3 | 1 |
9 | 446.34 | 267.99 | 2 | 2 | 0 |
10 | 604.56 | 200.98 | 3 | 2 | −1 |