Processing math: 100%

Selective further learning of hybrid ensemble for class imbalanced increment learning

  • Published: 01 January 2017
  • Primary: 58F15, 58F17; Secondary: 53C35

  • Incremental learning has been investigated by many researchers. However, only few works have considered the situation where class imbalance occurs. In this paper, class imbalanced incremental learning was investigated and an ensemble-based method, named Selective Further Learning (SFL) was proposed. In SFL, a hybrid ensemble of Naive Bayes (NB) and Multilayer Perceptrons (MLPs) were employed. For the ensemble of MLPs, parts of the MLPs were selected to learning from the new data set. Negative Correlation Learning (NCL) with Dynamic Sampling (DyS) for handling class imbalance was used as the basic training method. Besides, as an additive model, Naive Bayes was employed as an individual of the ensemble to learn the data sets incrementally. A group of weights (with the number of the classes as the length) are updated for every individual of the ensemble to indicate the 'confidence' of the individual learning about the classes. The ensemble combines all of the individuals by weighted average according to the weights. Experiments on 3 synthetic data sets and 10 real world data sets showed that SFL was able to handle class imbalance incremental learning and outperform a recently related approach.

    Citation: Minlong Lin, Ke Tang. 2017: Selective further learning of hybrid ensemble for class imbalanced increment learning, Big Data and Information Analytics, 2(1): 1-21. doi: 10.3934/bdia.2017005

    Related Papers:

    [1] Anupama N, Sudarson Jena . A novel approach using incremental under sampling for data stream mining. Big Data and Information Analytics, 2018, 3(1): 1-13. doi: 10.3934/bdia.2017017
    [2] Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu . Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data and Information Analytics, 2017, 2(1): 59-68. doi: 10.3934/bdia.2017008
    [3] Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu . Why Curriculum Learning & Self-paced Learning Work in Big/Noisy Data: A Theoretical Perspective. Big Data and Information Analytics, 2016, 1(1): 111-127. doi: 10.3934/bdia.2016.1.111
    [4] Yiwen Tao, Zhenqiang Zhang, Bengbeng Wang, Jingli Ren . Motality prediction of ICU rheumatic heart disease with imbalanced data based on machine learning. Big Data and Information Analytics, 2024, 8(0): 43-64. doi: 10.3934/bdia.2024003
    [5] Nickson Golooba, Woldegebriel Assefa Woldegerima, Huaiping Zhu . Deep neural networks with application in predicting the spread of avian influenza through disease-informed neural networks. Big Data and Information Analytics, 2025, 9(0): 1-28. doi: 10.3934/bdia.2025001
    [6] Xiangmin Zhang . User perceived learning from interactive searching on big medical literature data. Big Data and Information Analytics, 2017, 2(3): 239-254. doi: 10.3934/bdia.2017019
    [7] M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005
    [8] Jason Adams, Yumou Qiu, Luis Posadas, Kent Eskridge, George Graef . Phenotypic trait extraction of soybean plants using deep convolutional neural networks with transfer learning. Big Data and Information Analytics, 2021, 6(0): 26-40. doi: 10.3934/bdia.2021003
    [9] Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009
    [10] Jiaqi Ma, Hui Chang, Xiaoqing Zhong, Yueli Chen . Risk stratification of sepsis death based on machine learning algorithm. Big Data and Information Analytics, 2024, 8(0): 26-42. doi: 10.3934/bdia.2024002
  • Incremental learning has been investigated by many researchers. However, only few works have considered the situation where class imbalance occurs. In this paper, class imbalanced incremental learning was investigated and an ensemble-based method, named Selective Further Learning (SFL) was proposed. In SFL, a hybrid ensemble of Naive Bayes (NB) and Multilayer Perceptrons (MLPs) were employed. For the ensemble of MLPs, parts of the MLPs were selected to learning from the new data set. Negative Correlation Learning (NCL) with Dynamic Sampling (DyS) for handling class imbalance was used as the basic training method. Besides, as an additive model, Naive Bayes was employed as an individual of the ensemble to learn the data sets incrementally. A group of weights (with the number of the classes as the length) are updated for every individual of the ensemble to indicate the 'confidence' of the individual learning about the classes. The ensemble combines all of the individuals by weighted average according to the weights. Experiments on 3 synthetic data sets and 10 real world data sets showed that SFL was able to handle class imbalance incremental learning and outperform a recently related approach.



    1. Introduction

    Normal machine learning problems require learning model to learn information from all the achieved data and all the data are stored. However, in practice, the data are usually updated all the time and new information is necessary to be learned from the new data [19]. It is usually time consuming to learn new information with accessing to the previous data and storing the learned data is also expensive. In this situation, the learning model is required to have the ability of learning new information from new data and preserving the previously learned information without accessing the previous data. This learning model is called incremental learning [8], [21].

    In incremental learning, the whole data set is not available in a lump. In another word, we can only get a part of the whole data set every time. We suppose that the whole data set S is divided into T subsets, i.e., S1,S2,,ST. The rules (e.g., classification boundaries in classification problems) of S and St are denoted as R and Rt respectively. The aim of the learning model is to learn R by learning Rt from St respectively. The main difficulty is that the previously learned rules may be forgotten when the model learns new rules from new data subsets, especially when the rules of different data subsets are different. This phenomenon was called catastrophic forgetting. If R1=R2=...=RT, the learning model can learn R1 form S1 and R1 will not be forgotten when new data subsets are learned. In this case, incremental learning is not real challenging. However, in practice, Rt are usually different between different data subsets, so the catastrophic forgetting may happen.

    In our assumption, even though the rules are different between different data subsets, the target rules (i.e., R) are not changed. This phenomenon was also called virtual concept drift [28] and it is different from real concept drift, in which the target concept is changed when new data subsets are available. Virtual concept drift was called sampling shift in [22] and it will be referred to in this paper. There are some additive models that can be easily adopted to learn incrementally when sampling shift occurs. For example, in Bayes Decision Theory, the rules can be represented by some parameters and the parameters of the whole data set can be combined by those of all the data subsets. In this way, the models can learn the data subsets respectively to form a same learner of learning the whole data set. However, these kinds of methods often require assumptions about the data distribution and the decision boundaries are always simple. Neural networks have strong abilities to learn complex classification boundary. Unfortunately, they are not additive. By training with new data subsets, the model tends to perform well on the new data subsets but poorly on the previous ones [8]. In other words, the model forgets the previously learned rules. Therefore, it is a challenge to employ neural networks to learn incrementally in this situation.

    To exploit neural networks for incremental learning, some ensemble based approaches have been proposed. In our previous work, i.e., Selective Negative Correlation Learning (SNCL) [26], selective ensemble method was employed to pre-vent the model from forgetting previously learned information. There are also other ensemble based methods for incremental learning, such as Fixed Size Negative Correlation Learning (FSNCL), Growing Negative Correlation Learning (GNCL) [15], and Learn++ methods [21], [17], [5]. In SNCL and FSNCL (size-fixed methods), the model was able to learn new information from new data subsets with the size of the model fixed. How-ever, the ability of preserving previously learned information was not as good as Learn++ methods and GNCL (size-grown methods), in which the sizes of the models grown larger as new data subsets were learned. Since the new data subsets always become available all the time in practice, the sizes of the models will become too large in Learn++ methods and GNCL. Therefore, it is worthy to design a method with the benefits of both size-fixed methods and size-grown methods.

    Besides sampling shift, there is another issue in incremental learning, i.e., class imbalance. In normal learning model, class imbalance problem has been studied by many researchers and there are plenty of literatures addressing class imbalance problems [11], [4], [9]. Class imbalance problems may also occur in incremental learning and this kind of issue has also been investigated [5], [6]. There are mainly two cases for class imbalance incremental learning:

    (1) If the class distribution of the whole data set S is imbalanced, the class distribution of data subset St will usually be imbalanced. Furthermore, it will be common that samples of the minority classes may be lost in some data subsets.

    (2) Even though the class distribution of S is balanced, St may also be class imbalanced. In typical case, all of the partial sets St are class imbalanced but the combined data set S is class balanced.

    In this paper, we focus on class imbalance cases, in which sampling shift also occurs. Specifically, when sampling shift occurs, new classes may come up in the new data subset and some previous classes may be lost in the new data subset. When class distribution of the whole data set is imbalanced, this phenomenon will be more likely to happen to the minority classes. This is also the main issue in this paper.

    The rest of the paper is organized as follows. In Section Ⅱ, we will briefly review some existing methods for incremental learning. Our methods, i.e., Selective Further Learning (SFL) will be described in Section Ⅲ. Then in Section Ⅳ, the experimental studies will be presented. Finally, we will conclude this paper and discuss the future work in Section Ⅴ.


    2. Related work

    Some neural network based methods, such as the Adaptive Resonance Theory modules map (ARTMAP) [3], [23], [29], [2] and Evolving Fuzzy Neural Network (EFuNN) [12] have been proposed for incremental learning. Both ARTMAP and EFuNN can learn new rules by changing the architecture of the models, such as self-organize new clusters (in ARTMAP) and create new neurons (in EFuNN) when new data are sufficiently different from previous ones. However, it is usually a non-trivial task to estimate the difference between new data and previous ones. Moreover, both of them are very sensitive to the parameters of the algorithms.

    Researches have shown the good performance of ART-MAP and EFuNN in incremental learning. However, their abili-ties of learning incrementally in class imbalance situation have not been well investigated. In [5], where Learn++.UDNC was proposed for class imbalance incremental learning, fuzzy ARTMAP [2] was presented with poor performance when class imbalance occurs. Learn++.UDNC is an ensemble based method. It is one of the Learn++ series methods [21], [20] which were based on AdaBoost [7]. Besides Learn++.UDNC, many versions of Learn++ methods have been proposed, such as Learn++.MT [16], Learn++.MT2 [16], Learn++.NC [18] and Learn++.SMOTE [6]. In these versions, Learn++.MT and Learn++.NC was proposed for handling the problem of "out-voting" when learning new classes. Learn++.MT2 was pro-posed for handling the imbalance of examples between data subsets. These versions did not consider class imbalance situations. Class imbalance in incremental learning was addressed only in Learn++.UDNC and Learn++.SMOTE. In Learn++.UDNC, it was assumed that no real concept drift will happen, while in Learn++.SMOTE, real concept drift in class imbalanced data was investigated. Therefore, the former matches the issue in this paper but the later dose not.

    Besides Learn++, another type of ensemble based methods, i.e., methods based on Negative Correlation Learning (NCL) [14], have also been proposed for incremental learning [26], [15]. NCL is a method to construct neural networks ensemble. It is capable of improving the generalization performance of the ensemble by decreasing the error of every neural network and increasing the diversities between neural networks simultaneously. In [15], tow NCL-based methods, i.e., FSNCL and GNCL were proposed. In FSNCL, the size of the ensemble is fixed and all of the neural networks are trained when new data subsets become available. In GNCL, the size of the ensemble grows as the data sets are incrementally learned and only new added neural networks are trained when new data subsets become available. In our previous work [26], SNCL was proposed. In SNCL, new neural networks are added and trained when new data subsets become available and then a pruning method was employed to prune the ensemble to make the size of the ensemble fixed. Comparing to Learn++ methods and GNCL, FSNCL and SNCL can make the size of the ensemble fixed as more and more data sets come up while their abilities of preserving previously learned information are poorer than Learn++ and GNCL.

    There are also some other methods with ability of incremental learning. Self-Organizing Neural Grove (SONG) [10] is an ensemble based method with Self-Generating Neural Net-works (SGNNs) [27] as the individual learners. Incremental Backpropagation Learning Networks (IBPLN) [8] employed neural networks for incremental learning by making the weights of the neural network bounded and adding new nodes. However, they did not consider the class imbalance in incremental learning.


    3. Our method


    3.1. Framework

    In this paper, class imbalance is considered in incremental learning. In the existing work, Learn++.UDNC [5] was pro-posed for addressing this issue and it has been shown more effective than other incremental learning methods which did not consider class imbalance situation. However, as a size-grown method, the size of the ensemble in Learn++.UDNC increases all the way as new data sets become available. The size of the ensemble may become too large. In our method, ensemble based method is also considered and at the same time, we aim at controlling the size of ensemble at an acceptable level.

    In our previous work, i.e., SNCL [26], selective ensemble was used to keep the size of the ensemble fixed. When new data subset comes up, it is used to train the copy of the previous ensemble. The two ensembles are combined and half of the individuals in the ensemble are pruned to keep the size of the ensemble fixed. However, in this model, previous information loss may easily occur due to the pruning process based on the latest data subset. The ensemble will be biased to the latest data subset. Furthermore, if the rules of the latest data subset is quite different from that of the previous data subset, i.e., high sampling shift occurs, all of the individuals of the previous ensemble might be pruned. On the other hand, since SNCL was designed without considering class imbalance situation, it might not be good at handling class imbalance incremental problems.

    To overcome the above drawbacks, we propose a new en-semble based approach for incremental learning, i.e., Selective Further Learning (SFL). In SFL, a hybrid ensemble with two kinds of base classifiers was used. First of all, a group of Multi-Layer Perceptrons (MLPs) are used. When new data subsets become available, half of the MLPs in the current ensemble are selected to be trained with the new data subsets. After training, the selected MLPs are laid back to the ensemble. No pruning process will be executed so that the risk of previous information forgetting is reduced. At the same time, as an additive model, Naive Bayes (NB) is used as a component of the ensemble to incrementally learn from new data subsets. In this way, the strong incremental learning ability of NB will help the ensemble to preserve the previous information if high sampling shift occurs.

    In addition, a group of weights (namely impact weights) are constructed for every individual (including MLPs and NB). The weights and the outputs of the ith individual are denoted as {wik|k=1,2,...,C} and {oikk=1,2,...,C}, respectively, where C is the number of the classes. The impact weight wik is designed to indicate the 'confidence' of the output produced by the ith individual on class k. At the testing stage, for an example, the output of the ensemble is calculated by the weighted average of all individuals:

    yk=Mi=1wikoik/Mi=1wik,k=1,2,...,C, (1)

    where yk is the kth output of the ensemble and indicates the probability that the example belonging to class k, M is the number of the individuals in the ensemble. Equation (1) is used only at the testing stage. At the training stage, the output of the ensemble is calculated by the arithmetical average of the individuals.

    wik is initialized as 0 at the initial stage and updated during learning every new data subset. When updating wik, two issues should be considered.

    On one hand, the grade that the ith individual learn about the kth class is considered, i.e., the recall of the ith individual on class k and the precision of the ith individual on class k. wik should be high when both the recall and precision are high. To this end, the definition of F-measure for multi-class [24] is introduced:

    Fik=2RikPikRik+Pik (2)

    where Fik, Rik and Pik is the F-measure, recall and the precision of the ith individual on class k, respectively. According to [24], Rik and Pik are defined as

    Rik=N(i)kkCm=1N(i)km (3)

    and

    Pik=N(i)kkCm=1N(i)mk (4)

    where N(i)km is the number of the examples of class k that were classified as class m by the ith individual.

    On the other hand, since MLPs could be easily biased to the latest data subset, if some classes in the previous data subsets do not come up in the new data subset, the output of the MLPs that are selected to be trained with the new data subset should be suspectable. Therefore, a coefficient μi is defined for every individual i to degrade the impact weights:

    μi=ntnc, (5)

    where nt is the number of classes that are contained in the new data subset, nc is the number of classes in all the coming up data subsets.

    By considering both of the above issues, wik is updated as:

    wik=Fikμi. (6)

    For the model of NB, μi always equals to 1 since NB will not be biased to the latest data subset. For MLPs, μi is updated once the MLP is selected to be trained. For the MLPs which were not selected to be trained and the model of NB, Nkm from the new data subset can be accumulated to the previous one to update Rk and Pk and then update wik. In this way, wik is updated not according to the current data subsets only and it would be helpful for preserving previous information.

    The pseudo-code for the approach is presented in Fig. 1. In the pseudo-code, Select is the selecting process for selecting MLPs from the ensemble to be trained with the new data sub-set, MLPs-Training and NB-Training were the training process for training the MLPs and NB in the ensemble. The details of these processes are described in the following subsection.

    Figure 1. The Pseudo-Code For SFL.

    3.2. Some details inside SFL


    3.2.1. Selecting Process

    The selecting process is based on the current data subset St. The individuals are added to Enssel one by one by greedy strategy. Every time, every MLP in Ensres is temporarily added to Enssel to estimate the performance (i.e., the arithmetical mean F-measures of all classes) on the current data subset. The MLPs in Enssel are tested one by one and the MLP that makes Enssel perform the worst will be finally added to Enssel

    If the current data subset does not contain some classes that have appeared in the previous data subsets, the selection process should ensure that not all the MLPs that have been trained with the data of the lost classes are added to Enssel. Therefore, when an MLP is added to Enssel, the following constraint should be satisfied for the MLPs in Enssel:

    ΠkL(iwik)0. (7)

    where L={k|class k is not contained in St"}. If there is no MLP that can be added to Enssel, a new initialized MLP will be generated and added to Enssel.

    In this way, the MLPs that are not well trained are selected to be further trained. Besides, the MLPs that are reserved in Ensres could preserve the previously learned information.


    3.2.2. Training the model of Naive Bayes

    According to Bayes Decision Theory, the probability of an testing example x={xi|i=1,2,...,d} belonging to class k is

    P(k|x)=P(x|k)P(k)P(x)=P(x|k)P(k)Ck=1P(x|k)P(k) (8)

    where P(kx) is the posterior probability of an examples x belonging to class k, C is the number of classes and P(k) is the prior probabilities of class k. In class imbalance situation, we assume that P(k) are equal for all of the classes. Besides, all the features of the examples are assumed to be independent to each other. Therefore, the probability of (8) becomes

    P(k|x)=Πdi=1P(xi|k)Ck=1Πdi=1P(xi|k). (9)

    In incremental learning mode, P(xi|k) is updated as every new data subset comes up.P(xi|k) can be estimated in the form of n(xik)/n(k), where n(xik) is the number of examples that belongs to class k and the value of its ith feature is xi, n(k) is the number of examples that belongs to class k. Both n(xik) and n(k) can be estimated in each data subset and then accumulated to estimate P(xi|k). In this way, NB can learn from new data subsets without any loss of previous information.

    The estimation of P(kx) in (9) requires the values of features to be discrete. Specifically, for the features with continuous values, average partition is used to discretize the features for calculating P(xi|k).


    3.2.3. Training the ensemble of MLPs

    We have proposed a Dynamic Sampling (DyS) method for class imbalance problems [13], which can be used for training the ensemble of MLPs. Similarly to the approach proposed in [13], the main process of DyS for an ensemble is presented as follows (in one epoch):

    step1. Randomly fetch an example x from the training set;

    step2. Estimate the probability p that the example should be used for updating the ensemble.

    step3. Generate a uniform random real number μ between 0 and 1.

    step4. If μ<p, then use x to update the ensemble using Negative Correlation Learning (NCL) [23] to make every MLP negatively correlated to other individuals (including the MLPs in Ensres and the model of NB).

    step5. Repeat steps 1 to 3 until there is no example in the training set.

    The above steps will be repeated until stop criterion is satisfied. The following shows the method for estimating p, which was the main issue in DyS.

    In a problem with nc classes, we set nc output nodes for all of the MLPs and for an example belonging to class k, we set the target output of the example as t={ti|tk=1,t(j|k)=0}. The real output of the example is denoted as y={yi|i=1,2,...,nc}, so the node with the highest output designates the class. Both the hidden node functions and the output node functions of all MLPs are set as the logistic function φ(x)=1/(1+ex),so that yi(0,1).

    The same to [13], the probability that an example belonging to class k will be used to update the ensemble is estimated as:

    p={1,if δ<0,eδrkmini{ri},otherwise (10)

    where δ=ykmaxik{yi} is the confidence of the current ensemble correctly classifying the example. For more details of DyS, please refer to [13].

    By employing DyS, the MLPs in the ensemble are able to accommodate to class imbalance situations.


    3.3. The reason for the success of SFL

    In general, there are several essentials inside SFL that would make SFL successful, including the selective training of MLPs, the use of NB, the setting of impact weights for comb-ing the individuals in the ensemble, and the consideration of class imbalance in training process.

    To analyze the reason for the success of SFL, two especial cases in incremental learning are considered, i.e., new classes in the new data subsets and the loss of previous classes in the new data subsets. The ensemble are divided into three parts: MLPa,MLPb and NB, where MLPa is the MLPs that are selected for learning the new data subset and MLPb is the rest MLPs.

    After learning a new data subset which contains new classes, MLPa and NB have learnt the new classes while MLPb have not. In SFL, the impact weights for MLPb is updated by accumulating the performance of MLPb on previous data subsets and the current data subset. The examples of the new classes will be misclassified as other classes by MLPb. This will cause the degradation of precision of MLPb on those classes and finally degrade the impact weights of MLPb on those classes. In this way, the wrong prediction of MLPb on a testing example of new classes would impact less on the prediction of the whole ensemble. Therefore, MLPa and NB which have learnt the new classes will play a leading role in the ensemble when predicting the examples of new classes.

    After learning a new data subset which loses some previous classes, MLPa will be biased to the classes that are contained in the current data subset. However, the impact weights of MLPa will be degraded by a coefficient according to (5). Therefore, NB and the MLPs which is recently trained with the new classes will play a leading role in the ensemble when making the prediction.

    Besides, as we discuss before, NB is able to learn incrementally without forgetting previous information. The use of NB will help to prevent the ensemble from catastrophic forgetting. Furthermore, in the training of NB and MLPs, the situation of class imbalance is considered. Therefore, SFL is able to deal with class imbalance in the new data subsets.


    4. Experimental study

    To assess the performance of SFL, some synthetic data sets and real-world data sets were used to conduct the experiments. First of all, three types of synthetic data sets were generated to simulate the incremental learning process. Then, 5 real-world data sets with imbalanced class distributions from UCI repository [1] were used to simulate incremental learn-ing by randomly dividing the data sets. Finally, another 5 real-world data sets from UCI repository, including 3 class imbalanced data sets and 2 class balanced data sets, were used to simulate the incremental learning process by dividing the data sets. In this part, the dividing of the data sets considered new classes and the loss of previous classes in the new data subsets. The purpose of this part of experiment is to assess the ability of SFL learning form new classes and preserving previous in-formation when some classes are lost in the new data subsets. As a recently proposed approach which also addressed for class imbalanced incremental learning, Learn++.UDNC [5] was used for the comparison. Besides, in order to find out the efficiency of MLPs and NB to SFL, the model of ensembles with only MLPs (referred as SFL.MLP) and the model of NB are also compared with SFL. The recall of every class and the arithmetic mean values over recalls of all classes are used as the metric.

    Figure 2. This is Table 1.

    4.1. Experiments on synthetic data sets

    The synthetic data were generated as follows. Data of four 2-dimensional Normal Distributions were generated for four classes. The means were μ1=(0,0),μ2=(0,1),μ3=(1,1)andμ4=(1,0), the two features are independent with variances σ1=σ2=σ3=σ4=0.2. Three types of synthetic data sets were generated. TABLE Ⅰ presents the class distributions of every data subset for the three types. In Type A, there are three majority classes (class 1 to 3) and one minority class (class 4). Class 4 comes up as a new class in S2. Class 1 to 3 appears to be another minority class in training subsets S3 to S5, respectively. This experiment was conducted to see the performance of SFL on problems with multi majority classes (which appear to be minority classes sometimes) and single minority class (also comes up as new class). In Type B, there are one majority class and three minority classes. Class 2 comes up at the beginning but is lost in the last two training subsets. Class 3 comes up as a new class in S2 and is lost in the last training subset. Class 4 comes up as a new class in S3. This experiment was conducted to see the performance of SFL on problems with single majority class and multi minority classes, some of which come up as new classes and are lost in some data subsets. In Type C, the class distribution of the whole training set (i.e., the union of all the training subsets) is balanced. However, the training subsets are class imbalanced and every training subset contains only two classes. This experiment was conducted to see the performance of SFL on problems whose class distributions are balanced in total but imbalanced in data subsets. The distributions are quite different between the data subsets in all the three types.

    Figure 3. This is table2.
    Figure 4. This is table3.
    Figure 5. This is table4.
    Figure 6. This is table5.
    Figure 7. This is table6.

    10 MLPs with 20 hidden nodes of every MLP was used in SFL and SFL.MLP. The training stop error was 0.05 and the coefficient of the penalty term of NCL (referred as λ) was 0.5. The data sets were generated 30 times independently, and the means and standard deviations over 30 executions of the three types of data sets are presented in TABLE Ⅱ, TABLE Ⅲ and TABLE Ⅳ, respectively. Wilcoxon signed-rank test with the level of significance α=0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. The results on Type A data set are presented in TABLE Ⅱ. Comparing to Learn++.UDNC, SFL gets better overall recalls (i.e., the average of the recalls of all classes). The recalls of SFL on class 1 to class 3, which are majority classes, are not as good as Learn++.UDNC. However, Learn++.UDNC is biased too much to the majority classes and performs very poor on the only minority class while SFL performs more balanced over all the classes. Therefore, SFL outperforms Learn++.UDNC in this data set. Comparing to SFL.MLP and NB, there is few statistical difference on average recalls, especially after training with S3,S4 and S5. After training with S2, where class 4 comes up as a new class, SFL learns better of class 4 than SFL.MLP and as good as NB. At the same time, SFL does not degrade as much performance on class 1 as NB does. Although SFL degrades more performance on class 1 and class 3 than SFL.MLP, it performs better than SFL.MLP on class 2. Therefore, after training with S2, SFL performs better than both SFL.MLP and NB on the average recall. This observation indicates that SFL is capable of combining the advantages of both MLPs and NB to make a better model.

    The results on Type B data set are presented in TABLE Ⅲ. When comparing to Learn++.UDNC, the similar observations can be made and we can also conclude that SFL outperforms Learn++.UDNC in this data set. When comparing to SFL.MLP and NB, some values of SFL are between the values of SFL.MLP and NB (always closer to the larger ones), some values of SFL are significantly larger than both SFL.MLP and NB. Observing the results on Type C data set in TABLE Ⅳ, the similar observations can be made. All these results show that SFL outperforms Learn++.UDNC and is capable of combining the advantages of both MLPs and NB to make a better model.


    4.2. Experiments on real-world data sets

    The experiments on real-world data sets include three parts. First of all, 5 class imbalanced data sets were divided randomly to simulate the incremental learning process. Secondly, 3 class imbalanced data sets were divided with considering new classes and the loss of classes in the new data subsets. Finally, 2 class balanced data sets were divided into some class imbalanced subsets to simulate the incremental learning process. The situations of new classes and the loss of classes in the new data subsets were also considered.

    The class distributions of the 5 class imbalanced data sets that were randomly divided are presented in TABLE Ⅴ. Each one of these data sets was firstly stratified divided into training set (80%) and testing set (20%) and then the training set was randomly divided into 5 training subsets. The other real-world data sets, including 3 class imbalanced data sets and 2 class balanced data sets were divided according to predefined data distributions. The data distributions of all training subsets and testing sets were presented in TABLE Ⅵ. It can be observed from TABLE Ⅵ that for all the data sets, the data distributions between different training subsets are quite different and the situations of new classes and the loss of classes in the new data subsets occur in some training subsets.

    Figure 8. This is table7.

    For all the data sets, 10 MLPs with 20 hidden nodes of every MLP was used in SFL and the coefficient λ of NCL was 0.5. An independent execution was implemented for every data set to set the stop criterion for training MLPs to ensure the convergence of the training process. All the data sets were divided 30 times independently and for every time, all the comparing methods were executed once. The means and standard deviations of the overall recalls after every coming up of data subset over 30 executions of all the real-world data sets are presented in TABLE Ⅶ. Wilcoxon signed-rank test with the level of significance α=0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences.

    Figure 9. This is table8.
    Figure 10. This is table9.

    It can be observed form TABLE Ⅶ that SFL can outperform Learn++.UDNC on most of the data sets, including Soybean, Splice, Thyroid-allrep, Car, Nursery, Optdigits and Vehicle. On the other data sets, SFL also does not perform significantly worse than Learn++.UDNC. When comparing with SFL.MLP and NB, the performance of SFL usually leans to the better one of SFL.MLP and NB and sometimes SFL outperforms both of them, such as the performance on Soybean, Nursery, Optdigits and Vehicle. These observations go a step further to support that SFL is capable of combing the advantages of both MLPs and NB to make a better model.

    Figure 11. This is table10.

    On Car, Nursery, Page-blocks, Optdigits and Vehicle, the data sets were divided according the distribution presented in TABLE Ⅵ, where coming up new classes or losing previous classes usually occurs in the new data subsets. It will be worthy to see the detailed results of each class on these data sets. Therefore, the detailed results on two of them, i.e., Nursery (class imbalanced) and Optdigits (class balanced) were further presented.

    The means and standard deviations over 30 executions of Nursery are presented in TABLE Ⅷ. Wilcoxon signed-rank test with the level of significance α=0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. It can be observed from TABLE Ⅷ that, the performance of Learn++.UDNC on class 3 is much worse than that of SFL. Class 3 is a minority class. It comes up in S2 as a new class and is lost in S4. The observations indicate that SFL could handle this kind of problems. To see the effect of MLPs and NB in SFL, we pay more attentions to the comparison to SFL.MLP and NB. It can be observed that, MLPs could learn better than NB if there is not any class loss. However, when class 3 is lost in S4, MLPs degrades much more recall on class 3 than NB. As the combination of MLPs and NB, SFL does not degrade too much recall on class 3 and at the same time, SFL learns better than NB on other classes, which leads to the better overall recalls. Therefore, in this data set, MLPs help SFL to learn better and NB helps SFL to preserve the previously learned information especially when class loss occurs.

    Figure 12. This is table11.
    Figure 13. This is table12.

    The means and standard deviations over 30 executions of Optdigits are presented in TABLE Ⅸ. Wilcoxon signed-rank test with the level of significance α=0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. In S2, class 4 first comes up as a minority class, class 8 first comes up as a majority class, class 1, class 6 and class 10 are lost. After learning from S2, the recall of class 4 of SFL is larger than those of Learn++.UDNC and NB, but not as large as SFL.MLP; the recall of class 8 of SFL is the best of all others. At the same time, the degradation of class 1, class 6 and class 10 of SFL is much less than Learn++.UDNC and SFL.MLP and a bit more than NB. This observation indicates SFL can learn new classes with little performance degradation of other lost classes. Even though SFL.MLP can perform better on class 2 and class 4 when they first come up, it degrades much more on other classes. Therefore, it is not surprising that SFL gets the best overall recalls.

    The experimental results indicate that the performance of Learn++.UDNC usually leans to majority classes. Even though it sometimes performs better on minority classes, the performance on other classes are usually degraded too much. On contrary, SFL can usually get more balanced performance on different classes and get better overall performance. This is because of the different processing methods of SFL and Learn++.UDNC for handling class imbalance problems. In SFL, class imbalance is considered when training the model. The method for training MLPs has been shown to be effective for class imbalance problems. In Learn++.UDNC, the training process did not consider class imbalance and a transfer function with consideration of class imbalance was applied to the outputs. The effectiveness of the method has not been well proved. Even in the results presented in [5], the performance on minority classes was much worse than that of majority classes. Therefore, it is not surprising that SFL can outperform Learn++.UDNC on most of the data sets.


    4.3. Computational time

    The computational time of SFL and Learn++.UDNC on all the data sets is presented in TABLE X. It can be observed from TABLE X that SFL usually takes less computational time than Learn++.UDNC. In the experiments, the structures of MLPs were the same for SFL and Learn++.UDNC and the stop criterion were also the same. However, more MLPs were trained for Learn++.UDNC for every new data subset. On the other hand, the training process of SFL usually meet the stop criterion earlier than that of Learn++.UDNC. Therefore, SFL is usually faster than Learn++.UDNC.


    4.4. Analyses about the components of SFL

    In SFL, two kinds of base classifiers, i.e., MLPs and NB, are employed to construct the ensemble. The results have shown that SFL is capable of outperforming the models with only MLPs and the models with only NB. To find out the reason, the differences of SFL and its components (MLPs and NB) and the influences of the differences are investigated in detail.

    After every data subset is learned, four numbers are estimated on testing data set: the number of the examples that are correctly classified by only MLPs(#1) or NB (#2), the number of the examples that are correctly classified by only MLPs or NB and correctly classified by SFL (#3). Then four ratios are estimated:

    {ρ1=(#1+#2)/#tρ2=#1/(#1+#2)ρ3=#2/(#1+#2)ρ4=#3/(#1+#2) (11)

    where #t is the number of examples in testing data set. The ratios are estimated for all the data sets and the average values over 30 executions are presented in TABLE XI. ρ1 indecates the diversity (on making correct classification decisions) between MLPs and NB. ρ4 indecates the benefits that SFL gets from the difference between MLPs and NB. It can be observed from TABLE XI that the values of ρ4 are always closer to the larger one of rho2 and rho3 and sometimes exceed both of them. The observations partially show the reason that SFL always performs toward the better one of MLPs and NB and sometimes exceeds both of them.


    4.5. Analyses of parameters

    There are some parameters in SFL, including the number of MLPs, the number of hidden nodes in every MLP, the stop criterion for training MLPs and the coefficient λ in NCL. In our experimental studies, the number of MLPs and the number of hidden nodes in every MLP were set by experience. An independent execution was implemented for every data set to set the stop criterion to ensure the convergence of the training process. The coefficient λ in NCL is a parameter for controlling the diversities between the individuals in the ensemble (larger λ will lead to larger diversities). In the study of NCL[14], λ was suggested to be between 0 and 1. In our experimental studies, it was set to 0.5 for all the data sets. Since diversity is a very important issue for the success of ensemble learning methods [25], it is worthy to see the difference performance of SFL with different λ.

    Extra executions of SFL with λ=0,0.25,0.75 and 1 were conducted for all the used data sets. Wilcoxon signed-rank test with the level of significance α=0.05 was employed for comparing the overall recalls after training with each data subset. The results of every setting of λ were compared with the results of the other four settings of λ and the number of windraw-lose was counted and presented in TABLE Ⅻ. It can be observed from TABLE Ⅻ that λ affects the performance on most data sets. On some data sets, we can also observe the trend that the performance becomes better as λ decreases, such as Synthetic Type A, Synthetic Type B, Nursery, Page-blocks, Optdigits and Vehicle. In SFL, λ is not the only factor for encourage diversities. On one hand, the model built by NB may be quite different from the MLPs. On the other hand, in incremental learning, different MLPs may be trained with different data subsets, which will also result in diversities, especially when the data subsets are quite different. Therefore, large λ (such as 1) may emphasize too much to produce diversities so that the performance may be degraded.


    5. Conclusions and future work

    This paper investigates incremental learning in class imbalance situation. An ensemble-based method, i.e., SFL, which is a hybrid of MLPs and NB, was proposed. A group of impact weights (with the number of the classes as the length) was updated for every individual of the ensemble to indicate the 'confidence' of the individual learning about the classes. The weights affect the outputs of the ensemble by weighted aver-age of all individuals outputs. The training of MLPs and NB considered class imbalance so that the ensemble can adapt the situation of class imbalance.

    The experimental studies on 3 synthetic data sets and 10 real-world data sets have shown that the performance of SFL was better than that of a recently proposed approach for class imbalance incremental learning, i.e. Learn++.UDNC[9]. The experimental results have also shown that SFL can combine the advantages of both MLPs and NB to make a better model.

    SFL has successfully combined MLPs and NB. The experimental studies have shown that combining additive models can make progress in incremental learning. However, this is just an ordinary trial. Other additive models, such as parame-ter estimation model might also help to improve SFL. This would be a direction of our future work.


    [1] A. Asuncion and D. Newman, Uci machine learning repository, 2007.
    [2] Carpenter G. A., Grossberg S., Markuzon N., Reynolds J. H., Rosen D. B. (1992) Fuzzy artmap: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks 3: 698-713. doi: 10.1109/72.159059
    [3] G. A. Carpenter, S. Grossberg and J. H. Reynolds, ARTMAP: Supervised Real-Time Learning and Classification of Nonstationary Data by a Self-Organizing Neural Network Elsevier Science Ltd. , 1991.

    10.1109/ICNN.1991.163370

    [4] Chawla N. V., Japkowicz N., Kotcz A. (2004) Editorial: Special issue on learning from imbalanced data sets. Acm Sigkdd Explorations Newsletter 6: 1-6.
    [5] Ditzler G., Muhlbaier M. D., Polikar R. (2010) Incremental learning of new classes in unbalanced datasets: Learn?+?+?.UDNC. International Workshop on Multiple Classifier Systems, Multiple Classifier Systems 33-42. doi: 10.1007/978-3-642-12127-2_4
    [6] Ditzler G., Polikar R., Chawla N. (2010) An incremental learning algorithm for non-stationary environments and class imbalance. International Conference on Pattern Recognition 2997-3000. doi: 10.1109/ICPR.2010.734
    [7] Freund Y., Schapire R. E. (1999) A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14: 771-780.
    [8] Fu L., Hsu H.-H., Principe J. C. (1996) Incremental backpropagation learning networks. IEEE Transactions on Neural Networks 7: 757-761.
    [9] He H., Garcia E. A. (2009) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21: 1263-1284.
    [10] Inoue H., Narihisa H. (2005) Self-organizing neural grove and its applications. IEEE International Joint Conference on Neural Networks 2: 1205-1210. doi: 10.1109/IJCNN.2005.1556025
    [11] N. Japkowicz and S. Stephen, The Class Imbalance Problem: A Systematic Study IOS Press, 2002.
    [12] Kasabov N. (2001) Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning. IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society 31: 902-918. doi: 10.1109/3477.969494
    [13] Lin M., Tang K., Yao X. (2013) Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Transactions on Neural Networks and Learning Systems 24: 647-660.
    [14] Liu Y., Yao X. (1999) Simultaneous training of negatively correlated neural networks in an ensemble. IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society 29: 716-725.
    [15] Minku F. L., Inoue H., Yao X. (2009) Negative correlation in incremental learning. Natural Computing 8: 289-320. doi: 10.1007/s11047-007-9063-7
    [16] Muhlbaier M., Topalis A., Polikar R. (2004) Incremental learning from unbalanced data. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, IEEE 2: 1057-1062. doi: 10.1109/IJCNN.2004.1380080
    [17] Muhlbaier M., Topalis A., Polikar R. (2004) Learn++.mt: A new approach to incremental learning. Lecture Notes in Computer Science 3077: 52-61. doi: 10.1007/978-3-540-25966-4_5
    [18] M. D. Muhlbaier, A. Topalis and R. Polikar, Learn ++. nc: combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes, IEEE Transactions on Neural Networks 20 (2009), p152.
    [19] Ozawa S., Pang S., Kasabov N. (2008) Incremental learning of chunk data for online pattern classification systems. IEEE Trans Neural Netw 19: 1061-1074. doi: 10.1109/TNN.2007.2000059
    [20] Polikar R., Byorick J., Krause S., Marino A. (2002) Learn++: A classifier independent incremental learning algorithm for supervised neural networks. International Joint Conference on Neural Networks 1742-1747. doi: 10.1109/IJCNN.2002.1007781
    [21] Polikar R., Upda L., Upda S. S., Honavar V. (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems Man & Cybernetics Part C 31: 497-508. doi: 10.1109/5326.983933
    [22] Salganicoff M. (1997) Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artificial Intelligence Review 11: 133-155. doi: 10.1007/978-94-017-2053-3_5
    [23] Su M. C., Lee J., Hsieh K. L. (2006) A new artmap-based neural network for incremental learning. Neurocomputing 69: 2284-2300. doi: 10.1016/j.neucom.2005.06.020
    [24] Sun Y., Kamel M. S., Wang Y. (2006) Boosting for learning multiple classes with imbalanced class distribution. In Data Mining, 2006. ICDM'06. Sixth International Conference on, IEEE 592-602. doi: 10.1109/ICDM.2006.29
    [25] Tang E. K., Suganthan P. N., Yao X. (2006) An analysis of diversity measures. Machine Learning 65: 247-271. doi: 10.1007/s10994-006-9449-2
    [26] Tang K., Lin M., Minku F. L., Yao X. (2009) Selective negative correlation learning approach to incremental learning. Neurocomputing 72: 2796-2805. doi: 10.1016/j.neucom.2008.09.022
    [27] Wen W. X., Liu H., Jennings A. (2002) Self-generating neural networks. International Joint Conference on Neural Networks 4: 850-855.
    [28] Widmer G., Kubat M. (1993) Effective learning in dynamic environments by explicit context tracking. In Machine learning: ECML-93, Springer 667: 227-243. doi: 10.1007/3-540-56602-3_139
    [29] Williamson J. R. (1996) Gaussian artmap: A neural network for fast incremental learning of noisy multidimensional maps. Neural Networks 9: 881-897. doi: 10.1016/0893-6080(95)00115-8
  • Reader Comments
  • © 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3545) PDF downloads(532) Cited by(0)

Article outline

Figures and Tables

Figures(13)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog