
Data mining applications are becoming increasingly important for the wide range of manufacturing and maintenance processes. During daily operations, large amounts of data are generated. This large volume and variety of data, arriving at a greater velocity has its own advantages and disadvantages. On the negative side, the abundance of data often impedes the ability to extract useful knowledge. In addition, the large amounts of data stored in often unconnected databases make it impractical to manually analyse for valuable decision-making information. However, an advent of new generation big data analytical tools has started to provide large scale benefits for the organizations. The paper examines the possible data inputs from machines, people and organizations that can be analysed for maintenance. Further, the role of big data within maintenance is explained and how, if not managed correctly, big data can create problems rather than provide solutions. The paper highlights the need to have advanced mining techniques to enable conversion of data into information in an acceptable time frame and to have modern analytical tools to extract value from the big datasets.
Citation: Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen. Big data collection and analysis for manufacturing organisations[J]. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002
[1] | Chukwuebuka Okafor, Christian Madu, Charles Ajaero, Juliet Ibekwe, Happy Bebenimibo, Chinelo Nzekwe . Moving beyond fossil fuel in an oil-exporting and emerging economy: Paradigm shift. AIMS Energy, 2021, 9(2): 379-413. doi: 10.3934/energy.2021020 |
[2] | Benjamin O. Olorunfemi, Nnamdi Nwulu . Multi-agent system implementation in demand response: A literature review and bibliometric evaluation. AIMS Energy, 2023, 11(6): 1179-1210. doi: 10.3934/energy.2023054 |
[3] | Mojgan Pouralizadeh, Aliraza Amirtaimoori, Rossana Riccardi, Mohsen Vaez-Ghasemi . Supply chain performance evaluation in the presence of undesirable products: A case on power industry. AIMS Energy, 2020, 8(1): 48-80. doi: 10.3934/energy.2020.1.48 |
[4] | Pasquale Marcello Falcone . Analysing stakeholders’ perspectives towards a socio-technical change: The energy transition journey in Gela Municipality. AIMS Energy, 2018, 6(4): 645-657. doi: 10.3934/energy.2018.4.645 |
[5] | Basheer H. M. Altarturi, Ahmad Alrazni Alshammari, Buerhan Saiti, Turan Erol . A three-way analysis of the relationship between the USD value and the prices of oil and gold: A wavelet analysis. AIMS Energy, 2018, 6(3): 487-504. doi: 10.3934/energy.2018.3.487 |
[6] | Jorge Morel, Yuta Morizane, Shin'ya Obara . Seasonal shifting of surplus renewable energy in a power system located in a cold region. AIMS Energy, 2014, 2(4): 373-384. doi: 10.3934/energy.2014.4.373 |
[7] | Fitriani Tupa R. Silalahi, Togar M. Simatupang, Manahan P. Siallagan . A system dynamics approach to biodiesel fund management in Indonesia. AIMS Energy, 2020, 8(6): 1173-1198. doi: 10.3934/energy.2020.6.1173 |
[8] | Shahrouz Abolhosseini, Almas Heshmati, Masoomeh Rashidghalam . Energy security and competition over energy resources in Iran and Caucasus region. AIMS Energy, 2017, 5(2): 224-238. doi: 10.3934/energy.2017.2.224 |
[9] | Ngo Thai Hung . Interdependence of oil prices and exchange rates: Evidence from copula-based GARCH model. AIMS Energy, 2019, 7(4): 465-482. doi: 10.3934/energy.2019.4.465 |
[10] | Boua Sidoine KADJO, Mohamed Koïta SAKO, Kouadio Alphonse DIANGO, Amélie DANLOS, Christelle PERILHON . Characterization and optimization of the heat treatment of cashew nutshells to produce a biofuel with a high-energy value. AIMS Energy, 2024, 12(2): 387-407. doi: 10.3934/energy.2024018 |
Data mining applications are becoming increasingly important for the wide range of manufacturing and maintenance processes. During daily operations, large amounts of data are generated. This large volume and variety of data, arriving at a greater velocity has its own advantages and disadvantages. On the negative side, the abundance of data often impedes the ability to extract useful knowledge. In addition, the large amounts of data stored in often unconnected databases make it impractical to manually analyse for valuable decision-making information. However, an advent of new generation big data analytical tools has started to provide large scale benefits for the organizations. The paper examines the possible data inputs from machines, people and organizations that can be analysed for maintenance. Further, the role of big data within maintenance is explained and how, if not managed correctly, big data can create problems rather than provide solutions. The paper highlights the need to have advanced mining techniques to enable conversion of data into information in an acceptable time frame and to have modern analytical tools to extract value from the big datasets.
Alzheimer's disease (AD) is a neurological degenerative disease. As the most common form of dementia, the major feature of AD is memory decline [1]. Drugs can only delay the deterioration of AD but cannot cure it. Hematological examination, neurological tests, imaging techniques, etc. are always combined in a variety of ways to diagnose AD. Some functional imaging techniques, such as functional magnetic resonance imaging (fMRI) [2,3], positron emission tomography (PET) [4], and single photon emission computed tomography (SPECT), are useful in making objective evaluations of the severity of dementia. Some disadvantages of these techniques, such as high cost and potential exposure to radionuclide irradiation could limit their clinical applications [5,6].
EEG collection equipment is more economical, portable and non-invasive than other imaging techniques that is used in AD diagnosis. Moreover, EEG recording can detect the abnormalities of AD patients in electrical activities of the brain [7]. Over the past 40 years, a large number of researches have demonstrated that the alterations of EEG complexity, synchrony, and brain dynamics (the slowing of alpha rhythm and the diffuse dominance of theta or delta rhythm) in AD [7,8]. In order to characterize these alterations, researchers have proposed many different features. Relative band power[9], absolute band power [10], spectral, central tendency [11], mean, variance, and zero-crossing [12], auto mutual information, mean frequency [11], amplitude modulation [13] are the most frequently extracted EEG features for AD detection. Temporal-scale-specific fractal dimension [14] and cross-correlation analysis (DCCA) coefficients [15] are also useful to differentiate AD patients from healthy individuals. Combined with these features, some new algorithms [16,17] such as artificial neural network (ANN) classifier[18], support vector machine classifier [12] have been introduced to identify AD recently.
EEG signals are typical nonlinear time series [19]. A key measure of information is known as entropy, which has a strong relationship with nonlinear time series and dynamical systems. Entropy is defined as a measure of uncertainty of information in a statistical description of a system [20]. Permutation entropy [21], Approximate Entropy (ApEn) [22], Sample entropy (SampEn) [11,23], Spectral entropy [24], Fuzzy entropy [25] etc. are widely used in nonlinear dynamics and AD detection. Within the entropy family, approximate entropy and its modified methods have been introduced for studying regularity and complexity in physiological and biological time-series [26]. ApEn quantifies the conditional probability that two sequences which are similar for m points (within a given tolerance of r) remain similar when one consecutive point is included. SampEn is an improved algorithm of ApEn which avoids the bias caused by self-matching [22,26]. SampEn has been applied to EEG data to reveal a loss of complexity and a destruction of nonlinear structures in brain dynamics in AD [25].
Surrogate data method is a useful technique for nonlinearity hypothesis testing for time series analysis. Many researches have already proved that the existence of nonlinearity of EEG sequence by using surrogate data analysis [27]. Nonlinear measures, such as sample entropy, correlation dimension, and largest Lyapunov exponent, were computed on reconstructed signals of EEG. Nonparametric statistical tests were performed on the surrogate data to verify that the nonlinear measures are an intrinsic characteristic of the signals [28]. Moreover, original data always includes human judgment, and surrogate data method can provide a benchmark or control experiment, with which the original data can be compared [29]. A new method combining generalized sample entropy and surrogate data analysis for complex system analysis was proposed by Silva and Murta Jr. [27]. They analyzed heart rate variability (HRV) dynamics and calculated the generalized sample entropy of original time series and surrogate ones. This method was also used to analyze financial time series [30], stock market data [31] and traffic signals [32].
Inspired by this method, we proposed a method which combines classical SampEn with surrogate data, and this method is for the first time used to analyze the differences between normal people and AD patients. We would introduce three algorithms for generating surrogate data: simply shuffling the original time series, un-windowed Fourier transform algorithm (FT), and amplitude adjusted Fourier transform (AAFT) [33]. SampEn, as a complexity measure, was investigated and tested for EEG signal. Surrogate data was used to compute entropy differences between original dynamics and surrogate series.
The outline of the paper is as follows. In section 2, we give an overview of surrogate data and SampEn, and describe the analysis method to detect the difference of EEG data between AD patients and normal subjects. In section 3, the results of the method and corresponding explanations are presented. In section 4, a conclusion is drawn.
The database used in this study consisted of 34 subjects (20 healthy subjects and 14 patients with a diagnosis of AD). EEG signals were recorded for the subjects in a relaxed state with eyes closed with an average recording time of 130 seconds and a frequency of 250 Hz. As shown in Figure 1, o1 and o2 channels were placed on the occipital region, p3 and p4 channels on the parietal region, t3 and t4 channels on the temporal region, c3 and c4 channels on the central region, and f3 and f4 channels on the frontal region [34]. This database (20 healthy subjects and 14 AD patients) was also used by other relevant researches[20,34,35].
SampEn is designed to reduce the bias of ApEn and in closer agreement with theory for datasets with known probabilistic content. Moreover, SampEn displays the property of relative consistency in situations where ApEn does not. Increases of SampEn is often associated to increases of complexity.
The calculation of sample entropy is as follows:
Arrange x(1),x(2),...,x(N) to form an m-dimensional vector.
Xm(i)=[x(i),x(i+1),...,x(i+m−1)];1≤i≤N−m+1 | (2.1) |
Define d[Xm(i),Xm(j)] as the largest distance between Xm(i) and Xm(j).
d[Xm(i),Xm(j)]=max∣x(i+k)−x(j+k)∣ | (2.2) |
where 1≤k≤m−1,1≤i,j≤N−m+1,i≠j.
Given a threshold value r(r>0), for 1≤i≤N−m,i≠j, define Bmi(r) as:
Bmi(r)=1N−m−1num{d[Xm(i),Xm(j)]<r} | (2.3) |
We can calculate the average for all of i :
Bm(r)=1N−m−1N−m∑i=1Bmi(r) | (2.4) |
For m+1, we have
Ami(r)=1N−m−1num{d[Xm+1(i),Xm+1(j)]<r} | (2.5) |
where 1≤i≤N−m,i≠j.
The average for all of i is:
Am(r)=1N−m−1N−m∑i=1Ami(r) | (2.6) |
Sample entropy can be calculated as:
SampEn(m,r)=limN→∞{−ln[Am(r)/Bm(r)]} | (2.7) |
Computation of SampEn depends on three parameters: length of the epoch (N), the number of previous values used for the prediction of the consequent value (m), and threshold that determines the similarity of patterns (r). The threshold (r) is defined as relative fraction of the standard deviation (SD) of the N amplitude values [36].
A is the self-similar probability of time series when the dimension is m. When the dimension is m+1, the self-similar probability of time series is B. We can infer that CP=A/B. Obviously, SampEn(m, r, N) is precisely the negative natural logarithm of the CP. A dataset of length N, having repeated itself within a tolerance r for m points, will also repeat itself for m−1 points, without allowing self-matches. SampEn does not use a template wise approach, and A and B accrue for all the templates [36].
According to other studies and theoretical consideration, the parameters set m=2, and r=0.20∗SD are used in this study.
In the surrogate data method, a null hypothesis is first proposed (for example, assuming that the original data is linear), and then, surrogate data is generated by different algorithms such as FT or simply shuffling the original time series. Different surrogate data retain different characteristics of original data.
The first algorithm we will use is simply shuffling the time-order of the original time series. The surrogate data is obviously guaranteed to have the same amplitude distribution as the original data, but any temporal correlations that may exist in the original data are destroyed.
The surrogate data generated by FT algorithm is constructed to keep the same Fourier spectrum as the original data. The Fourier transform has a complex amplitude at each frequency as we all know. First, to randomize the phases, we multiply each complex amplitude by eiϕ, in which ϕ is independently chosen for each frequency from the interval [0,2π]. We must ensure that ϕ(−f)=−ϕ(f), so that the inverse Fourier transform can be real (no imaginary components). Finally, the inverse Fourier transform is the surrogate data [29]. For AAFT algorithm, the idea is to first rescale the values in the original time series so that they are gaussian. Then the FT or WFT algorithm can be used to make surrogate time series which have the same Fourier spectrum as the rescaled data. Finally, the gaussian surrogate is then rescaled back to have the same amplitude distribution as the original time series.
After that, the statistic feature of the original data and the surrogate data are separately calculated. Theiler considered that there is a great deal of flexibility in the selection of statistics. The statistical test method as shown in the equation below is used to compare the difference between the original data and the surrogate data.
Let Qorig denote the statistic computed for the original time series, and Qsurri for the ith surrogate data generated under the null hypothesis. Let μsurr and σsurr denote the mean and standard deviation of the distribution of Qsurri.
We define the significance as:
S=|Qorig−μsurr|σsurr | (2.8) |
If the distribution of the statistic is gaussian (and numerical experiments indicate that this is often a reasonable approximation), then the p-value is given by p=erfc(S/√2).
We often use Kolmogorov-Smirnov or Mann-Whitney test to compare the full distributions of the observed data and the surrogate data directly. Student-t test only compare their means. For the present purposes, we use a kind of t-test.
We studied 20 healthy subjects and 14 AD patients who were in relaxed and eye-closed state. Original EEG data covered 10 electrodes: c3, c4, f3, f4, o1, o2, p3, p4, t3, t4.
Preliminarily, we calculated SampEn of the original EEG data of 34 subjects at each electrodes. Then we chose to calculate the SampEn of surrogate data at c3, o2, o1, f3 electrodes. The choice of these four electrodes was based on the consequences of the last step and previous studies [20]. There were approximately 32,720 samples collected for each time series in the study. To evaluated the influence of the entropic index of SampEn, we calculated the difference between SampEn of the original time series and average SampEn of their surrogate data. At first, for each given time series, 300 surrogate series were generated respectively by three different algorithms that we mentioned before. That means, for each given original series, 900 surrogate series were generated. SampEn for each surrogate series and the mean SampEn (qsurr) of the 300 surrogate series were calculated. SampEn was also calculated for original time series (qorig). qSD was defined as: qSD=|qsurr−qorig|.
At last, the t-test which is based on double sample heteroscedasticity hypothesis was used to test the significance of difference between healthy subjects and AD patients. The analysis tool was applied to two samples which are from different populations, which assumes that the variance is unequal and unknown, to test whether there is a significant difference between the means of two samples. If the two-tailed truncation probability (p-value) is greater than 0.01, then the null hypothesis will not be rejected, which means there is no significant difference between the means of two samples.
We calculated the mean and variance of SampEn of 34 subjects (20 healthy subjects and 14 AD patients) at each electrodes. The results were respectively shown in Table 1, and we can infer that the SampEn of healthy subjects was larger than that of AD patients. However, the details of the datasets may be lost when we averaged the data. We then plotted the SampEn of 20 healthy subjects (left) and 14 AD patients (right) as Figure 2 showed. A decagon represented a person, and one vertex of the decagon represented the value of SampEn at each electrodes. For most of the 14 AD patients, the value of SampEn was less than 3.00, while a part of healthy subjects were larger than 3.00 on the contrary. As our mentioned above, increases of SampEn were often associated to increases of complexity generally, and thus it could be confirmed that suffering from AD would cause complexity loss. However, there was a partial overlap between (a) and (b), in which SampEn of healthy subjects was slightly larger than that of AD patients. The main reason was probably the individual difference. SampEn of healthy subjects was obviously larger than AD patients at t3 electrode which is close to the brain areas of memory functions.
Group | mean and var | c3 | c4 | f3 | f4 | o1 | o2 | p3 | p4 | t3 | t4 |
healthy subjects | mean | 2.7464 | 2.7556 | 2.7181 | 2.7321 | 2.9439 | 3.0126 | 2.7928 | 2.8819 | 2.9060 | 2.8513 |
var | 0.0715 | 0.0485 | 0.0361 | 0.0399 | 0.0186 | 0.0674 | 0.0113 | 0.0117 | 0.3242 | 0.0742 | |
AD patients | mean | 2.5493 | 2.5961 | 2.5874 | 2.6022 | 2.8268 | 2.7734 | 2.6903 | 2.6959 | 2.5652 | 2.6797 |
var | 0.0131 | 0.0304 | 0.0056 | 0.0080 | 0.0160 | 0.0286 | 0.0481 | 0.0468 | 0.0356 | 0.0529 | |
p-value | 0.0067 | 0.0252 | 0.0098 | 0.0160 | 0.0156 | 0.0027 | 0.1235 | 0.0082 | 0.0198 | 0.0563 |
T-test, based on double-sample-heteroscedasticity hypothesis, was performed to test the significance of difference between healthy subjects and AD patients. Statistically speaking, SampEn of healthy subjects was different from AD patients at electrodes c3, c4, f3, f4, o1, o2, p4, t3 (p<0.05), and significantly different (p<0.01) at electrodes c3, f3, o2, p4.
However, we have no idea how accurate the original data is. Furthermore, repeating the experiments is time consuming and will bring into some exogenous variables. The time series requires a sufficient number of samples to achieve statistical test of time series analysis. Sample acquisition can be done by the method called surrogate data, which can directly construct the time series itself and can save time. Surrogate data have to make itself random but retain the characters of original data (including amplitude distribution, autocorrelation functions, etc.).
It has been confirmed that, as a chaotic time series, EEG data has a number of characteristics of nonlinear dynamics. Therefore, chaotic time series analysis methods can be applied to analyze EEG signals. Surrogate data analysis, an indirect method, cannot only analyze chaotic time series, but also can deepen the understanding of related knowledge. Moreover, there is always room for human judgment with real data. Theiler argued that besides formally rejecting a null hypothesis, the method of surrogate data can also be used to in an informal way, provide a benchmark or control experiment, with which the actual data can be compared [29].
We generated 900 different sets of surrogate data using three different algorithms (shuffling: 300 sets of surrogate data; FT: 300 sets of surrogate data; AAFT: 300 sets of surrogate data) for each set of original data at electrodes c3, o2, o1, f3. Among these four electrodes, c3, f3, o2 were selected due to the consequence of original data, o1 was selected as a contrast. And then the mean and standard deviation (SD) of the 300 SampEn (for 300 series of surrogate data generated by one algorithm) were calculated. We selected an AD patient at o2 electrode and drew a frequency histogram of 300 SampEn for three different algorithms. Figure 3 showed that the frequency histograms of 300 SampEn of surrogate data for an AD patient. The origin of the x-axis was the value of SampEn of original data. The curve on the left was the distribution of SampEn for 300 sets of AAFT surrogate data; the middle one was for FT surrogate data; the right one was for simply shuffling the original time series. The curves were far away from each other and there was no overlap among them, in which the value of SampEn for shuffling was maximal. The surrogate data had a higher SampEn comparison with original time series. Values of p<0.01 were considered to indicate there is highly significant difference. Then the null hypothesis was rejected, which means EEG signals are nonlinear time series [37].
Table 2 shows the results of SampEn at electrodes c3, o2, o1, f3. Corresponding to the Figure 3, SampEn of surrogate data generated by three algorithms are all larger than that of original data, in which shuffling is maximal, and this consequence may be related to the various surrogate data algorithms. It has been clearly shown in Table 3 that different surrogate data retain different characteristics of original data. Surrogate data generated by shuffling the time-order of the original time series is obviously guaranteed to have the same amplitude distribution as the original data, but any temporal correlations that may exit in the data are destroyed. The FT surrogate data are constructed to have the same Fourier spectrum and autocorrelation function as the original data, but randomize the phases of a Fourier transform, and the first-order characteristics (mean, SD, etc.) are preserved [38]. AAFT algorithm, as an improved algorithm based on FT, provides a surrogate of the original time series which retains its amplitude distribution, the first-order characteristics, and autocorrelation function [39]. The surrogate data generated by AAFT algorithm keeps the characteristics of original data mostly, so that the value of SampEn is closest to original data. There is no highly significant difference (p>0.01) in SampEn between normal people and AD patients for surrogate data, and the reason why surrogate data cannot tell the difference is that other characteristics of original data may be dropped. To fix this problem, in this paper, we defined qSD=|qsurr−qorig| (qsurr is the mean of 300 SampEn; qorig is SampEn of original data).
electrodes | data | healthy subjects | AD patient | p-value |
c3 | original | 2.7464±0.0715 | 2.5493±0.0131 | 0.007 |
shuffling | 5.4949±0.1575 | 5.6821±0.1068 | 0.226 | |
FT | 3.8067±0.0973 | 3.7685±0.1031 | 0.732 | |
AAFT | 2.8655±0.0904 | 2.8044±0.0787 | 0.549 | |
o2 | original | 3.0126±0.0674 | 2.7734±0.0286 | 0.003 |
shuffling | 5.4126±0.1317 | 5.5975±0.2119 | 0.306 | |
FT | 4.0612±0.0669 | 3.8688±0.0660 | 0.046 | |
AAFT | 3.1240±0.0641 | 2.9242±0.0526 | 0.023 | |
f3 | original | 2.7181±0.0361 | 2.5874±0.0056 | 0.009 |
shuffling | 5.6923±0.0770 | 5.7800±0.0609 | 0.419 | |
FT | 3.8337±0.0673 | 3.7833±0.0699 | 0.586 | |
AAFT | 2.8774±0.0559 | 2.8121±0.0479 | 0.414 | |
o1 | original | 2.9439±0.0186 | 2.8268±0.0160 | 0.016 |
shuffling | 5.4751±0.1949 | 5.3209±0.0895 | 0.330 | |
FT | 4.0046±0.0499 | 3.9206±0.0826 | 0.369 | |
AAFT | 3.0381±0.0244 | 2.9379±0.0508 | 0.164 |
shuffling | FT | AAFT | |
amplitude distribution | √ | √ | |
the first-order characteristics | √ | √ | |
Fourier spectrum | √ | √ | |
autocorrelation function | √ | √ |
Surrogate data were used here to compute entropy differences between original dynamics and surrogate series. The ability to differentiate situations of low-dimensional deterministic chaos from stochastic processes is due to the use of surrogate data series [27]. SampEn, as a visualized statistics, indicated the difference of healthy subjects and AD patients. We used three different algorithms to calculate the qSD.
Table 4 shows the results of qSD. The significance of values for these groups was tested with t-test. Comparing healthy subjects with AD patients at c3, o2, f3, o1 electrodes for shuffling algorithm, p=0.023, p=0.032, p=0.763, p=0.072 were obtained respectively. Only at c3, o2 electrodes, p<0.05 was found, which means the significant difference between AD patients and healthy subjects in c3, o2 electrodes.
algorithm | Group | c3 | o2 | f3 | o1 |
AAFT | healthy subjects | 0.1245±0.0154 | 0.1148±0.0069 | 0.1665±0.0219 | 0.0963±0.0031 |
AD patients | 0.2552±0.0464 | 0.1526±0.0227 | 0.2247±0.0357 | 0.1111±0.0175 | |
p-value | 0.055 | 0.403 | 0.345 | 0.687 | |
FT | healthy subjects | 1.0604±0.0199 | 1.0486±0.0128 | 1.1156±0.0268 | 1.0606±0.0226 |
AD patients | 1.2193±0.0688 | 1.1509±0.0766 | 1.1960±0.0580 | 1.0937±0.0508 | |
p-value | 0.053 | 0.208 | 0.290 | 0.636 | |
shuffling | healthy subjects | 2.7797±0.1929 | 2.3945±0.2570 | 2.9510±0.1129 | 2.5810±0.1965 |
AD patients | 3.1755±0.1149 | 2.8545±0.2152 | 3.2052±0.0755 | 2.5364±0.0658 | |
p-value | 0.023 | 0.032 | 0.763 | 0.072 |
The reason why there was no significant difference between healthy subjects and AD patients for FT and AAFT algorithms is probably that the common feature between surrogate data and original data has been eliminated by subtraction operation, and the differences remained may be weakened. That means, the more details were surrogated by FT and AAFT algorithms from original data, the less information will be reserved in calculating qSD. On the contrary, surrogate data generated by shuffling algorithm is guaranteed to have the same amplitude distribution as the original data, so the subtraction operation have less impact on the statistical test. In other words, shuffling algorithm here can detect the significant difference much better.
We selected six healthy subjects and six AD patients at o2 electrode and try to find the difference of the frequency histogram of 300 SampEn between two groups. Figure 4 showed the distribution of 300 SampEn for surrogate data generated by AAFT algorithm. Although the distribution of the data can be immediately seen from the frequency histogram, it is not a good way to identify whether the distribution of data comes from a specific distribution. Normal probability plots are widely used as a statistical tool for assessing whether an observed simple random sample is drawn from a normally distributed population [40]. Figure 5, corresponding to Figure 4, showed the probability plot for normal distribution which compares the distribution of the data to the normal distribution. The plot included a reference line, which is useful for judging whether the data follows a normal distribution. A single small graph represented the SampEn distribution of one person. One-sixth of healthy subjects had a positive skew distribution which is u-shaped, while half of AD patients had that distribution. The most probable conclusion of the phenomenon was that more positive skew distribution of SampEn would exist in AD patients. This way can give us another perspective to visualize the distribution of data.
A relevant study using this same database revealed a significant reduction in complexity in AD, as measured with the ApEn mean, at electrodes c3 and o2 [20]. Other previous studies using ApEn [41], SampEn [42], and Fuzzy entropy analyzed different database. Although, it was found that ApEn and SampEn were significantly lower in AD patients than in healthy subjects at electrodes p3, p4, o1, and o2, the classification accuracy obtained with receiver operating characteristic (ROC) curves at all of those electrodes between them is different [22,25,43]. SampEn showed the superior discriminating power when compared to ApEn which could arise from the fact that SampEn is an improvement of ApEn. Besides, ApEn results should be interpreted with great care, as this is a biased entropy estimator and not as reliable as other algorithms [25]. These results are also supported by recent findings with Fuzzy entropy [25]. All of these results support that EEG activity of AD patients is significantly more regular (less complex) than in a normal brain in the parietal and occipital regions. Our study proved that c3 electrode also showed less complex activity and indicated that the parietal regions may also be affected.
A large number of researches have demonstrated the alterations of EEG complexity, synchrony, and brain dynamics in AD. Many different features of EEG series were extracted for AD detection [7,8]. A key measure of time series is known as entropy [22]. We proposed a method which combined SampEn with surrogate data to analyze the differences between healthy subjects and AD patients. The value of SampEn is often associated to complexity, AD could cause complexity loss, which thus give rise to the smaller values of SampEn. The method of surrogate data was used here as control experiment, with which the actual data can be compared.
We observed the SampEn of each electrode for 20 healthy subjects and 14 AD patients. As the results showed, SampEn were different (p<0.01) between AD patients and healthy subjects at electrodes c3, f3, o2, p4. We then introduced three surrogate algorithms to calculated qSD=|qsurr−qorig| (qsurr is the mean of 300 SampEn, qorig is SampEn of original data) for four electrodes: c3, f3, o2, o1, and performed a t-test which is based on double sample heteroscedasticity hypothesis for each electrode. Results showed that there was significant difference between the healthy subjects and AD patients at c3, o2 electrodes for shuffling algorithm. This approach is first used to analyze the differences between healthy subjects and AD patients from a different perspective. Other studies using this same database found the significant reduction in complexity at c3 and o2 electrodes, their consequences are consistent with our study [20]. Meanwhile, our result showed EEG signals were nonlinear time series. It means that our method is feasible.
However, we didn't find the complexity loss at p3 and p4 electrodes. There are several possible reasons for this. The surrogate data had a higher SampEn than original time series. Values of p<0.01 were considered significant, and then the null hypothesis can be rejected, which means EEG signals are nonlinear time series. As stated above, this is the disadvantage of SampEn, because an uncorrelated version of the signals cannot be more complex than the original ones [27]. However, the values of SampEn can reflect some information in a sense. Some improved methods such as Modified generalized multiscale sample entropy [30,31], generalized sample entropy [27,32] can be used to analysis EEG signals.
In order to obtain adequate samples to achieve statistical test of time series analysis, and realize reproduction of experiments in a way, we adopted surrogate data method. Combined with characteristics of surrogate data, different information can be extracted from the original data, so that we can achieve many different purposes. The nonlinearity existed on EEG signal, SampEn with surrogate data can identified the nonlinear feature from the data effectively. Our method is capable of distinguishing AD patients from healthy subjects, and can provide insights for the understanding of AD. We don't have more information about the patients (such as age and gender), so that the analysis of the differences between AD patients and normal people cannot be more detailed. We will continue to have more investigations on this method in the future using more datasets with detailed information.
This work is supported in part by the National Basic Research Program of China under Grant 2013CB329501, in part by the National Natural Science Foundation of China under Grant 81271645, in part by the Public Projects of Science Technology Department of Zhejiang Province under Grant 2013C33162, and in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY 12H18004, and in part by the Science and Technology Commission of Shanghai Municipality (Grant No. 18411970300), and the Health Industry Clinical Research of the Shanghai Health and Family Planning Committee (Grant No. 201840018).
The authors declare that they have no competing interests.
[1] | Levitan A. V., Redman T. C. (1998) Data as a resource: Properties, implications and prescriptions. Sloan Management Review 40: 89-101. |
[2] | A. Koronios, S. Lin and J. Gao, A data quality model for asset management in engineering organisations, Proceedings of the 10th International Conference on Information Quality, Massachusetts Institute of Technology, Cambridge, USA, 2005. |
[3] | G. Gilliland, S. K. Barger, V. Bhatia and R. Nicol, Creating value through data integrity: A pragmatic approach, BCG Perspectives, (2011), Available at http://www.bcgindia.com/documents/file83320.pdf. |
[4] |
Rao J. S., Zubair M., Rao C. (2003) Condition monitoring of power plants through the Internet. Integrated Manufacturing Systems 14: 508-517. ![]() |
[5] | O. Prakash, Asset management through condition monitoring -How it may go wrong: A case study, Proceedings of the 1st World Congress on Engineering Asset Management, (WCEAM) 2006, Gold Coast, Queensland, Australia, July 11–14,2006. |
[6] |
Kalogirou S. A. (2003) Artificial intelligence for the modeling and control of combustion processes: A review. Progress in Energy and Combustion Science 29: 515-566. ![]() |
[7] | Liao S. H. (2005) Expert system methodologies and applications -A decade review from 1995 to 2004. Expert Systems with Applications 28: 93-103. |
[8] |
K. Warwick, A. O. Ekwue and R. Aggarwal, Artificial Intelligence Techniques in Power Systems, Institution of Electrical Engineers, Stevenage, UK, 1997. 10.1049/PBPO022E |
[9] | K. Wang, Intelligent Condition Monitoring and Diagnosis System a Computational Intelligent Approach, Frontiers in Artificial Intelligence and Applications, 93,2003. |
[10] | Rao M., Yang H., Yang H. (1996) Integrated distributed intelligent system for incident reporting in DMI pulp mill, success and failures of knowledge-based systems in real-world applications. Proceedings of the First International Conference : 169-178. |
[11] | Rao M., Zhou J., Yang H. (1998) Architecture of integrated distributed intelligent multimedia system for on-line real-time process monitoring. SMC'98 Conference Proceedings, 1998 IEEE International Conference on Systems, Man, and Cybernetics 2: 1411-16. |
[12] |
Rao M., Zhou J., Yang H. (1998) Integrated distributed intelligent system architecture for incidents monitoring and diagnosis. Computers in Industry 37: 143-151. ![]() |
[13] |
Reichard K. M., Van Dyke M., Maynard K. (2000) Application of sensor fusion and signal classification techniques in a distributed machinery condition monitoring system. Proceedings of SPIE -The International Society for Optical Engineering 4051: 329-336. ![]() |
[14] |
J. Campos and O. Prakash, Information and communication technologies in condition monitoring and maintenance, in Dolgui, A., Morel, G and Pereira, C. E. (Eds. ) Information Control Problems in Manufacturing, Elsevier, 39 (2006), 3-8. 10.3182/20060517-3-FR-2903.00003 |
[15] |
Campos J. (2009) Survey paper: Development in the application of ICT in condition monitoring and maintenance. Computers in Industry 60: 1-20. ![]() |
[16] | Sycara K. P. (1998) MultiAgent Systems. AI Magazine 19. |
[17] |
Feng J. Q., Buse D. P., Wu Q. H., Fitch J. (2002) A multi-agent based intelligent monitoring system for power transformers in distributed substations. International Conference on Power System Technology Proceedings 3: 1962-1965. ![]() |
[18] |
Weaver A. C. (1997) The internet and the world wide web. 23 rd International Conference on Industrial Electronics, Control and Instrumentation 4: 1529-1540. ![]() |
[19] | D. Stenmark, Designing the new intranet, Gothenburg Studies in Informatics, Report 21, March 2002. |
[20] |
M. D. Assunção, R. N. Calheiros, S. Bianchi, M. A. S. Netto and R. Buyya, Big data computing and clouds: Trends and future directions, Journal of Parallel and Distributed Computing, Special Issue on Scalable Systems for Big Data Management and Analytics, (2015), 79-80, 3-15. 10.1016/j.jpdc.2014.08.003 |
[21] |
B. Xu and S. Kumar, Big Data Analytics Framework For System Health Monitoring, Presented at the 2015 IEEE International Congress on Big Data (BigData Congress), IEEE Computer Society, 2015.
10.1109/BigDataCongress.2015.66 |
[22] |
Fumeo E., Oneto L., Anguita D. (2015) Condition based maintenance in railway transportation systems based on big data streaming analysis. Procedia Computer Science 53: 437-446. ![]() |
[23] |
A. Mohamed, M. S. Hamdi and S. Tahar, A machine learning approach for big data in oil and gas pipelines, Presented at the 2015 International Conference on Future Internet of Things and Cloud (FiCloud), 2015 International Conference on Open and Big Data (OBD), IEEE Computer Society, 2015.
10.1109/FiCloud.2015.54 |
[24] |
A. Nunez, J. Hendriks, L. Zili, B. De Schutterand and R. Dollevoet, Facilitating maintenance decisions on the dutch railways using big data: The ABA case study, Presented at the 2014 IEEE International Conference on Big Data (Big Data), IEEE. 2014.
10.1109/BigData.2014.7004431 |
[25] | A. Parida and U. Kumar, Managing information is key to maintenance effectiveness, in Proceedings of Intelligent Maintenance System, Arles, France, 15-17 July, 2004. |
[26] | P. Soderholm, Continuous Improvement of Complex Technical System: Aspects of Stakeholder Requirements and System Functions, Licentiate Thesis, Division of Quality and Environmental Management, Lulea University of Technology, Lulea, 2003. |
[27] |
Chen G., Wua S., Wang Y. (2015) The evolvement of big data systems: From the perspective of an information security application. Big Data Research 2: 65-73. ![]() |
[28] |
Fan W., Bifet A. (2012) Mining big data: Current status, and forecast to the future. SIGKDD Explorations 14: 1-5. ![]() |
[29] | N. Taleb, Antifragile: How to Live in a World We Don't Understand, Penguin Books Limited, 2012. |
[30] | A. Parida, Role of condition monitoring and performance measurements in asset productivity enhancement, 20th International Conference on Condition Monitoring and Diagnostic Engineering Management, Faro, Portugal, 2007. |
[31] |
Jagadish H. V., Gehrke J., Labrinidis A., Papakonstantinou Y., Patel J. M., Ramakrishnan R., Shahabi C. (2014) Big data and its technical challenges. Communications Of The ACM 57: 86-94. ![]() |
[32] | Kumar U., Galar D., Parida A., Stenström C., Berges L. (2013) Maintenance performance metrics: A state-of-the-art review. Journal of Quality in Maintenance Engineering 19: 233-277. |
[33] |
Orlikowski W. J., Barley S. R. (2001) Technology and institutions: What can research on information technology and research on organizations learn from each other?. MIS Quarterly 25: 145-165. ![]() |
[34] | S. Rogers, Big data is scaling bi and analytics, Available at http://www.informationmanagement.com/issues/21-5/big-data-is-scaling-bi-and-analytics-10021093-1.html, 2011. |
[35] |
Fan J., Han F., Liu H. (2014) Challenges of big data analysis. National Science Review 1: 293-314. ![]() |
[36] | Wu X., Zhu X., Wu G.-Q., Ding W. (2014) Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26: 97-107. |
[37] |
Hsieh J. C., Li A. H., Yang C. C. (2013) Mobile, cloud, and big data computing: Contributions, challenges, and new directions in telecardiology. International Journal of Environmental Research and Public Health 10: 6131-6153. ![]() |
[38] | ISACA, Generating Value From Big Data Analytics, White Paper, Retrieved from (http://www.isaca.org), 2014. |
[39] |
Bollen J., Mao H., Zeng X. (2011) Twitter mood predicts the stock market. Journal of Computational Science 2: 1-8. ![]() |
[40] |
Gundami A., Haider M. (2015) Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management 35: 137-144. ![]() |
[41] |
Oborski P. (2004) Man-machine interactions in advanced manufacturing systems. The International Journal of Advanced Manufacturing Technology 23: 227-232. ![]() |
[42] | J. Horák, I. Ivan, T. Inspektor and J. Tesla, Sparse big data problem: A case study of czech graffiti crimes, In: Ivan I., Singleton A., Horák J., Inspektor T. (eds) The Rise of Big Spatial Data, Lecture Notes in Geoinformation and Cartography, Springer, 2017. |
[43] |
Yan Y., Chen L. J., Zhang Z. (2014) Error bounded sampling for analytics on big sparse data. Proceedings of the VLDB Endowment 7: 1508-1519. ![]() |
[44] | Kumar P. K., Rao P. C., Changala R., Rao T. J., Shankar P. H. (2015) Data mining challenges with big data. International Journal for Research in Applied Science & Engineering Technology (IJRASET) 3: 148-150. |
[45] | Longadge R., Dongre S. S., Malik L. (2013) Class imbalance problem in data mining: Review. International Journal of Computer Science and Network (IJCSN) 2. |
[46] | He H., Garcia E. A. (2009) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21: 1263-1284. |
[47] |
Sun Y., Wong A. K. C., Kamel M. S. (2009) Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23: 687-719. ![]() |
[48] |
Weiss G. M. (2004) Mining with rarity: A unifying framework. SIGKDD Explorations 6: 7-19. ![]() |
[49] |
Widmer G., Kubat M. (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23: 69-101. ![]() |
[50] |
Zhang P., Zhu X., Shi Y. (2008) Categorizing and mining concept drifting data streams. the 14th ACM SIGKDD International Conference On Knowledge Discovery And Data Mining : 812-820. ![]() |
[51] | Chandak M. B. (2016) Role of big data in classification and novel class detection in data streams. Journal of Big Data 3: 1-9. |
[52] | Zliobaite I., Pechenizkiy M., Gama J. (2015) An overview of concept drift applications. Big Data Analysis: New Algorithms for a New Society 16: 91-114. |
1. | Victor Omefe, Mahlon Kida Marvin, Zakiyyu Muhammad Sarkinbaka, Victor Inumidun Fagorite, Aliyu Buba Ngulde, Prospect studies and geological CO2 storage potential in Nigeria, 2025, 7, 26665190, 100206, 10.1016/j.uncres.2025.100206 |
Group | mean and var | c3 | c4 | f3 | f4 | o1 | o2 | p3 | p4 | t3 | t4 |
healthy subjects | mean | 2.7464 | 2.7556 | 2.7181 | 2.7321 | 2.9439 | 3.0126 | 2.7928 | 2.8819 | 2.9060 | 2.8513 |
var | 0.0715 | 0.0485 | 0.0361 | 0.0399 | 0.0186 | 0.0674 | 0.0113 | 0.0117 | 0.3242 | 0.0742 | |
AD patients | mean | 2.5493 | 2.5961 | 2.5874 | 2.6022 | 2.8268 | 2.7734 | 2.6903 | 2.6959 | 2.5652 | 2.6797 |
var | 0.0131 | 0.0304 | 0.0056 | 0.0080 | 0.0160 | 0.0286 | 0.0481 | 0.0468 | 0.0356 | 0.0529 | |
p-value | 0.0067 | 0.0252 | 0.0098 | 0.0160 | 0.0156 | 0.0027 | 0.1235 | 0.0082 | 0.0198 | 0.0563 |
electrodes | data | healthy subjects | AD patient | p-value |
c3 | original | 2.7464±0.0715 | 2.5493±0.0131 | 0.007 |
shuffling | 5.4949±0.1575 | 5.6821±0.1068 | 0.226 | |
FT | 3.8067±0.0973 | 3.7685±0.1031 | 0.732 | |
AAFT | 2.8655±0.0904 | 2.8044±0.0787 | 0.549 | |
o2 | original | 3.0126±0.0674 | 2.7734±0.0286 | 0.003 |
shuffling | 5.4126±0.1317 | 5.5975±0.2119 | 0.306 | |
FT | 4.0612±0.0669 | 3.8688±0.0660 | 0.046 | |
AAFT | 3.1240±0.0641 | 2.9242±0.0526 | 0.023 | |
f3 | original | 2.7181±0.0361 | 2.5874±0.0056 | 0.009 |
shuffling | 5.6923±0.0770 | 5.7800±0.0609 | 0.419 | |
FT | 3.8337±0.0673 | 3.7833±0.0699 | 0.586 | |
AAFT | 2.8774±0.0559 | 2.8121±0.0479 | 0.414 | |
o1 | original | 2.9439±0.0186 | 2.8268±0.0160 | 0.016 |
shuffling | 5.4751±0.1949 | 5.3209±0.0895 | 0.330 | |
FT | 4.0046±0.0499 | 3.9206±0.0826 | 0.369 | |
AAFT | 3.0381±0.0244 | 2.9379±0.0508 | 0.164 |
shuffling | FT | AAFT | |
amplitude distribution | √ | √ | |
the first-order characteristics | √ | √ | |
Fourier spectrum | √ | √ | |
autocorrelation function | √ | √ |
algorithm | Group | c3 | o2 | f3 | o1 |
AAFT | healthy subjects | 0.1245±0.0154 | 0.1148±0.0069 | 0.1665±0.0219 | 0.0963±0.0031 |
AD patients | 0.2552±0.0464 | 0.1526±0.0227 | 0.2247±0.0357 | 0.1111±0.0175 | |
p-value | 0.055 | 0.403 | 0.345 | 0.687 | |
FT | healthy subjects | 1.0604±0.0199 | 1.0486±0.0128 | 1.1156±0.0268 | 1.0606±0.0226 |
AD patients | 1.2193±0.0688 | 1.1509±0.0766 | 1.1960±0.0580 | 1.0937±0.0508 | |
p-value | 0.053 | 0.208 | 0.290 | 0.636 | |
shuffling | healthy subjects | 2.7797±0.1929 | 2.3945±0.2570 | 2.9510±0.1129 | 2.5810±0.1965 |
AD patients | 3.1755±0.1149 | 2.8545±0.2152 | 3.2052±0.0755 | 2.5364±0.0658 | |
p-value | 0.023 | 0.032 | 0.763 | 0.072 |
Group | mean and var | c3 | c4 | f3 | f4 | o1 | o2 | p3 | p4 | t3 | t4 |
healthy subjects | mean | 2.7464 | 2.7556 | 2.7181 | 2.7321 | 2.9439 | 3.0126 | 2.7928 | 2.8819 | 2.9060 | 2.8513 |
var | 0.0715 | 0.0485 | 0.0361 | 0.0399 | 0.0186 | 0.0674 | 0.0113 | 0.0117 | 0.3242 | 0.0742 | |
AD patients | mean | 2.5493 | 2.5961 | 2.5874 | 2.6022 | 2.8268 | 2.7734 | 2.6903 | 2.6959 | 2.5652 | 2.6797 |
var | 0.0131 | 0.0304 | 0.0056 | 0.0080 | 0.0160 | 0.0286 | 0.0481 | 0.0468 | 0.0356 | 0.0529 | |
p-value | 0.0067 | 0.0252 | 0.0098 | 0.0160 | 0.0156 | 0.0027 | 0.1235 | 0.0082 | 0.0198 | 0.0563 |
electrodes | data | healthy subjects | AD patient | p-value |
c3 | original | 2.7464±0.0715 | 2.5493±0.0131 | 0.007 |
shuffling | 5.4949±0.1575 | 5.6821±0.1068 | 0.226 | |
FT | 3.8067±0.0973 | 3.7685±0.1031 | 0.732 | |
AAFT | 2.8655±0.0904 | 2.8044±0.0787 | 0.549 | |
o2 | original | 3.0126±0.0674 | 2.7734±0.0286 | 0.003 |
shuffling | 5.4126±0.1317 | 5.5975±0.2119 | 0.306 | |
FT | 4.0612±0.0669 | 3.8688±0.0660 | 0.046 | |
AAFT | 3.1240±0.0641 | 2.9242±0.0526 | 0.023 | |
f3 | original | 2.7181±0.0361 | 2.5874±0.0056 | 0.009 |
shuffling | 5.6923±0.0770 | 5.7800±0.0609 | 0.419 | |
FT | 3.8337±0.0673 | 3.7833±0.0699 | 0.586 | |
AAFT | 2.8774±0.0559 | 2.8121±0.0479 | 0.414 | |
o1 | original | 2.9439±0.0186 | 2.8268±0.0160 | 0.016 |
shuffling | 5.4751±0.1949 | 5.3209±0.0895 | 0.330 | |
FT | 4.0046±0.0499 | 3.9206±0.0826 | 0.369 | |
AAFT | 3.0381±0.0244 | 2.9379±0.0508 | 0.164 |
shuffling | FT | AAFT | |
amplitude distribution | √ | √ | |
the first-order characteristics | √ | √ | |
Fourier spectrum | √ | √ | |
autocorrelation function | √ | √ |
algorithm | Group | c3 | o2 | f3 | o1 |
AAFT | healthy subjects | 0.1245±0.0154 | 0.1148±0.0069 | 0.1665±0.0219 | 0.0963±0.0031 |
AD patients | 0.2552±0.0464 | 0.1526±0.0227 | 0.2247±0.0357 | 0.1111±0.0175 | |
p-value | 0.055 | 0.403 | 0.345 | 0.687 | |
FT | healthy subjects | 1.0604±0.0199 | 1.0486±0.0128 | 1.1156±0.0268 | 1.0606±0.0226 |
AD patients | 1.2193±0.0688 | 1.1509±0.0766 | 1.1960±0.0580 | 1.0937±0.0508 | |
p-value | 0.053 | 0.208 | 0.290 | 0.636 | |
shuffling | healthy subjects | 2.7797±0.1929 | 2.3945±0.2570 | 2.9510±0.1129 | 2.5810±0.1965 |
AD patients | 3.1755±0.1149 | 2.8545±0.2152 | 3.2052±0.0755 | 2.5364±0.0658 | |
p-value | 0.023 | 0.032 | 0.763 | 0.072 |