1.
Introduction
Human activity recognition (HAR) is a classification task in machine learning [1]. Its goal is to classify some activities performed by a user within a certain period of time. People's activities can include different types, such as walking, running, going upstairs, going downstairs, sitting, etc. The human activity recognition program can be applied to the fields of medical care, fitness and so on [2,3]. The machine learning model trains on data collected by smart devices with accelerometer and gyroscope sensors to achieve the purpose of tracking the user's health [4,5].
In real life, data is scattered among individual users or various institutions. The machine learning model requires a large amount of data to train better model. However, the method of concentrating user data to a central server will cause the privacy of data to be leaked. Moreover, the recent regulation, like GDPR is designed to protect the privacy of user [6]. Users or institutions cannot share their data to the data center, which makes it difficult to use these valuable data to train powerful machine learning models.
Federated learning was proposed by Google, which can train a shared global model collaboratively while keeping user data scattered [7,8]. A typical method of implementing federated learning is federated averaging (FedAvg) [7], which updates parameters on the server by averaging the local model parameters uploaded from each client. After multiple iterations, a shared global neural network model is generated in the server and distributed to each client. These researches focused on federated learning based on neural networks. Due to the excellent characteristics of other machine learning, many researches have begun to pay more attention to training other machine learning models in federated settings [9,10]. Considering the faster training speed of the tree model and the high accuracy of the tree-based ensemble model, some studies have applied the tree-based ensemble model to the setting of federated learning, such as federated gradient boosting decision tree [11], federated extreme tree [12].
In the training process of federated learning, the original data is stored locally on the client and is not exposed to the server or any other users for alleviating the privacy leakage of user data. However, some studies have shown that the parameters or intermediate information in the model training process may still leak user privacy [13]. To ensure the privacy and security of user data in federated learning, some works mainly used differential privacy or homomorphic encryption to protect the intermediate parameters in the model training process [14,15]. Recently, Mo et al. [16] utilized the widely existing trusted execution environment (TEE) in mobile devices to hide model updates from attackers through local TEE training on the client and TEE security aggregation on the server.
Federated learning can effectively alleviate the problem of data islanding and has been widely used in various practical tasks, especially in the field of medical and health care. A system based on blockchain technology elements and threaded federated learning was proposed [17]. An agent with a consortium mechanism was constructed for the classification results of many machine learning solutions. This research provides the new multi-agent model that can be implemented as a real-time medical data processing system. Sozinov et al. [18] used federated learning to train a classifier to solve the challenge of insufficient data for a single user in human activity recognition tasks. However, an important problem in federated learning is that the final global model lacks personalization. Most methods are based on all users to generate a common model. Due to the heterogeneity of user data in actual federated learning, generating a unified model may not be the best solution for all users. We can see that in the human activity recognition task, different users have different physical characteristics and daily activities. Therefore, a unified model cannot meet the needs of all users and cannot achieve personalized medicine. In this case, each user wants to obtain a personalized model instead of a global shared model after participating in federated learning.
Some existing personalization methods are designed for the training of neural networks in federated learning, but there is no relevant research on the personalized methods of tree-based federated models. The lightweight tree-based model is more suitable for training and deployment on wearable devices with limited computing. So we are mainly concerned about how to apply the federated random forest to the task of activity recognition, and generate a personalized federated model for each user. Inspired by previous work, we propose a new privacy-protected federated personalized random forest model (PP-FPRF) to accurately and securely support real-world activity recognition applications. We have three main contributions:
Personalization. The federated personalized random forest is considered from the two levels. First, from the data point of view, according to the data characteristics of the users in the activity recognition task, the locality sensitive hashing (LSH) [19] is used to measure the data similarity between users, and the user and other users with similar data characteristics are trained in cooperation. Second, from the model of view, the user selects the base classifier by the ensemble learning incremental selection to achieve the purpose of a personalized model.
Privacy protection. In the process of users cooperative training, to protect users' privacy, each user participating in the training communicates the optimal split of candidate attributes in non-leaf nodes and the classes counts in leaf nodes based on exponential mechanism and Laplace mechanism, respectively.
Feasibility. We evaluated the proposed framework based on real human activities recognition datasets and conducted extensive experiments. The experimental results show the effectiveness of our model.
This rest of this paper is organized as follows. Section 2 overviews the related work of our research. The preliminaries on locality sensitive hashing and differential privacy are introduced in Section 3. In Section 4, we describe our approach in detail. The experimental evaluations and results are discussed in Section 5. Finally, Section 6 summarizes the paper.
2.
Related work
Since our work is related to tree-based federated learning and personalized federated learning, we discuss some existing methods. In addition, we also analyze the differences between our work and existing methods.
2.1. Personalized federated learning
Some studies have paid attention to the heterogeneity of data in federated learning and have proposed some personalized solutions. Wang et al. [20] fine-tuned the federated model through the local data in each client to realize the personalization of the user model. After training a unified federated model, in the process of personalized learning, all convolutional and pooling layers in the network are frozen, and only the parameters of the fully connected layer are updated by using stochastic gradient descent (SGD). Fedhealth aggregated model parameters through federated learning and then applied in personalized medicine by building a personalized model for each user through transfer learning [21]. In multi-task learning [22], multiple related tasks are solved simultaneously, allowing the model to take advantage of the commonalities and differences of different tasks through cooperative learning. Smith et al. [23] developed the framework for federated multi-task learning (MOCHA) algorithm to solve personalized problem. Yu et al. [24] extended the personalization method and proposed three different schemes to personalize the federated model: fine-tuning, multi-task learning and knowledge distillation.
But these works are aimed at the personalized training of neural networks in federated learning. The random forest model is a lightweight model that is more suitable for computing constrained wearable devices [21], so we focus on the application of federated random forest in activity recognition tasks and design the corresponding personalized method to improve the model effect.
2.2. Tree-based federated learning
The tree-based ensemble model has been applied to horizontal and vertical federated learning. In vertical federated learning, the client has the same samples but different feature spaces. In this direction, Liu et al. [9] proposed a federated forest framework based on classification and regression trees (CART) and bagging. This framework has a certain degree of privacy protection, and the communication burden is not high when forecasting. In horizontal federated learning, data samples with the same characteristics are distributed in multiple parties. Li et al. [11] studied an actual federated environment with loose privacy constraints. Medical institutions jointly trained the gradient boosting decision tree (GDBT) model and used gradient weighting to improve the performance of the model. Liu et al. [12] extended extra-trees to provide a concise algorithm to limit the computational complexity to a minimum, and greatly increase the training speed to adapt to horizontal federated scenarios, while using differential privacy to protect intermediate data.
These methods based on tree are suitable for collaborative training between several medical, financial and other data institutions. They consider generating a unified model for all users without considering the personalization. The necessary motivation for this collaboration is that federated learning should generate a better learning model than a model generated from the local data of the users alone. However, for human activity recognition tasks, the number of users is larger and the model needs to be personalized, current methods based on tree are not effective enough. So we investigate the the tree-based personalized federated learning for activity recognition tasks.
3.
Preliminaries
In this section, we give some basic concepts and review the knowledge about locality sensitive hashing and differential privacy.
3.1. Locality-sensitive hashing
LSH was originally proposed by Gionis et al. [19]. It is a fast nearest neighbor search algorithm for massive high-dimensional data. The main idea of LSH is to select a hash function so that the hash values of two neighboring points are equal with a high probability. On the contrary, the hash values of two non-neighboring points are not equal with a high probability. For a domain S of the points set, an LSH family is defined as:
Definition 1. [19]. A family H of functions from S to U, H={h:S→U} is called (r1,r2,p1,p2)-sensitive if for any v,q∈S, d(v,q) is the distance between two vectors:
The characteristic of LSH is that there will be multiple input data corresponding to the same hash value output. Therefore, LSH has been used to protect user privacy in applications such as keyword search [25] and recommendation systems [26]. A widely used p-stable LSH family is proposed by Datar et al. [27]. The hash functions Fa,b are expressed as
where v is a d-dimensional vector representing a sample; a is a d-dimensional vector with entries selected independently from p-stable distribution [27]; b is a real number randomly selected from the range [0, r]; r is a positive real number representing the size of the window.
3.2. Differential privacy
The differential privacy model proposed by Dwork et al. [14], disturbs the calculation results to ensure that deleting or adding a single item in the database will not affect the output of the database access mechanism. This shows that it is difficult for opponents to judge whether a person is in the database through indistinguishable differences. In this way, personal sensitive information is protected.
Definition 2. (ϵ-Differential privacy [14]). A randomization mechanism M provides ϵ-differential privacy if for databases D1 and D2 differing on one element, R is the output range:
The privacy budget ϵ controls the privacy protection level of differential privacy, and a smaller privacy budget represents stronger privacy protection.
Definition 3. (Sensitivity [28]). Given a function f:D→Rd over an arbitrary domain D, the global sensitivity of f is defined as
where D1 and D2 differ in one record.
To obtain ϵ-differential privacy, the noise is calibrated according to the sensitivity of the function. The sensitivity of a real-valued function represents the maximum possible change in its value due to the addition or deletion of a single record.
Theorem 1. (Laplace Mechanism [28]). Given a function f:D→Rd over an arbitrary domain D, the computation:
provides ϵ-differential privacy.
For example, the count function f over a set S, f(S)=|S|, the sensitivity of the function is 1. Therefore, a noisy count that returns M(S)=|S|+Lap(1ϵ).
Theorem 2. (Exponential Mechanism [29]). Suppose the input of random calculation M is the dataset D, the output is r∈Range(M), q(D,r) is the quality function, and Δq is the sensitivity of the quality function. If the algorithm selects and outputs r from the range with a probability proportional to exp(ϵq(d,r)2Δ(q)), then algorithm M provides ϵ-differential privacy protection.
The following is an example of the exponential mechanism [30]. If a competition is to be held, the available items are from the collection {swimming,running,basketball}. Participants will vote to determine an item and ensure that the entire decision process meets the ϵ-differential privacy protection. Taking the number of votes as the availability function, so Δq=1. Then according to the exponential mechanism, under a given privacy protection budget ϵ, the output probabilities of various projects can be calculated.
Theorem 3. (Sequential Composition [31]). Given algorithm M1,M2,…,Mn, their privacy budgets are set to ϵ1,ϵ2,…, ϵn, respectively. Then for the same dataset D, the combined algorithm composed of these algorithms M(M1(D),M2(D), …,Mn(D)) provides (∑ni=1ϵi)−differential privacy protection.
This property shows that for a differential privacy sequential composition algorithm, its privacy protection level is the sum of all privacy budgets.
Theorem 4. (Parallel Composition [31]). Given algorithm M1,M2,…,Mn, their privacy budgets are set to ϵ1,ϵ2,…,ϵn, respectively. Then for disjoint data sets D1,D2,…,Dn, the combined algorithm composed of these algorithms M(M1(D),M2(D),…, Mn(D)) provides (max ϵi)-differential privacy protection.
In a differential privacy protection algorithm sequence, if all the datasets processed by these algorithms do not intersect each other, then the privacy protection level provided by the algorithm sequence depends on the algorithm with the worst protection level, that is, the algorithm with the largest privacy budget.
4.
Methods
In this section, we introduce the federated personalized random forest framework, which allows random forest models to be trained for every user in a horizontal federated setting. A user only trains with some similar users instead of all users. Our motivation is that in the task of activity recognition, people have different physical characteristics and activity patterns, and their data are very different. The models trained by similar users are more suitable for their own characteristics.
Table 1 summarizes the important symbols that will be used frequently in this paper. In the activity recognition task, each user Ui has its own local dataset Di and can train a local model Mil based on Di. The user obtains a unified model Mf through traditional federated learning. But the unified model Mf does not consider the differences among users. Therefore, it does not achieve good performance on some users, and is even worse than some users' local models. Our work is to let each user Ui gets a personalized random forest model Mip through personalized federated learning.
Figure 1 shows the structure of our approach. The user Ui and its similar users are taken as an example to describe the personalized training phase. Step 1 is the stage of finding similar users. Each user first uses the global hash tables to find users who are similar to them. In step 2, the user trains a random decision tree with similar users. Then in step 3, the user individually makes a personalized selection of the newly generated decision tree. An overview of PP-FPRF algorithm is shown in Algorithm 1. The inputs are the LSH functions {Fk}k=1,2...L, all users data {Di}i=1,2...I, the number of trees T and differential privacy budget B. User participates in training and gets his own personalized model Mip. In line 1, we call function PREPROCESS [11] for each user Ui to obtain his similar users Si. Each user Ui can initiate a training session and coordinate his similar users Si to train a tree. In this training, Ui is regarded as the master node, and his similar users participating in the training are regarded as cooperative users. A user in federated learning can initiate a training session as a master or participate in a training as a collaborator. In line 3, a master obtained a new tree by Algorithm 2 (TREEBUILD_M) , at the same time, a collaborator get a new tree by Algorithm 3 (TREEBUILD_C) . In line 9, the user participating in the training uses INCRE_SELECT of the decision tree based on his local data, to determine whether add the new tree to his own personalization model Mip.
4.1. Users similarity calculation
The PREPROCESS method utilizes the widely used p-stable LSH function to obtain the similarity of any two samples in different users [27], without exposing the original data to other users [11]. According to the characteristics of LSH, if two samples are similar, they are more likely to be hashed to the same value. Therefore, by using multiple LSH functions, the bigger the number of identical hash values of two samples, the greater the likelihood that they are similar.
The description of function PREPROCESS is as follows. Given L randomly generated p−stable hash functions, the users first calculate the hash values corresponding to their samples, and each sample is mapped to L hash values by L hash functions. The AllReduce operation is used to build L global hash tables, and the inputs to AllReduce are the sample IDs and their hash values of all users. The reduction operation is to combine the samples IDs with the same hash value. By adopting the previously proposed bandwidth optimal and contention free approach [32], propagate the aggregated hash tables to each user. After each user gets the global hash tables, calculates the similarity with other users. For example, user Ui calculates the number of identical hash values for each sample in user Ui and each sample in user Uj. If the number of the same hash value of the two samples is bigger than a specific threshold, the two samples are considered similar.
Li et al. [11] used LSH to find similar samples to weight the gradient. The basic idea is that the instance is important if it is similar to many other instances. Different from them, we use LSH to find similar samples and continue to find similar users of users. User Ui counts how many samples in Uj are his similar samples, and calculates the proportion of similar samples to Uj. Because of the randomness of the LSH function, it is not easy for us to define similar users with specific threshold. Therefore, through comparison, each user finds the top-k users with higher similar samples ratio, as their own similar users.
4.2. Personalized training stage
Through similarity calculation, each user gets his own set of similar users. To make the model generated by the federated learning suitable for the user's local data, the user only trains the federated learning model with similar users. The generated personalized model can combine the generalization characteristics of the global model and the data matching characteristics of the local model. In this subsection, we will introduce our personalized methods considered from the data level and model level: similar users training and incremental selection methods.
4.2.1. Similar users training
Algorithms 2 and 3 describe the cooperating training process of the master node and similar users. The key steps of building a tree are as follows.
In the process of training a random tree, the master is responsible to coordinate cooperating users, and according to the total splitting information to determine the splitting attributes of a node or choose to stop splitting. We call function RANDOMPICK provided by Liu et al. [12], allows the master to exchange information with cooperating users and return candidate attributes F′ and splitting values {vk,k=1,...,|F′|} to cooperating users.The cooperative user Uj temporarily splits the local data into left and right parts according to each value in {vk,k=1,...,|F′|}.We use the information gain (IG) quality function to evaluate the scores of nodes divided by different attributes and corresponding values.
Each user finds the attribute with the largest information gain as the local best attribute in the candidate set F′. To protect user privacy, we use exponential mechanism to select the local best attribute fjk, the details are described in Section 4.3. The master also has its own local optimal attribute fik by perturbing, and receives the local optimal splitting attribute sent from the cooperating user. Then, he determines the global optimal splitting attribute f′ by means of weighted voting. The larger the sample size N of the user, the greater the weight in the federated learning. The attribute f′ with the largest weight is selected as the best splitting method for the current node. Then the master sends the determined node splitting method to the partners.
In RANDOMPICK [12], the master Ui first randomly selects a subset F′⊂F as candidate splitting attributes. Then sends F′ to cooperative user Uj∈Si who participated in the training. For attribute fk∈F′, each cooperative user Uj randomly selects a value vjk within the range of minimum and maximum values of attribute fk, and sends vjk to the master. The master Ui combines the local random splitting value vik and the received splitting values {vjk,j=1,...,|Si|} from other users, and takes the minimum and maximum value of vik and {vjk,j=1,...,|Si|}. Then, randomly selects a value vk from the range of minimum and maximum value, as the splitting value of the candidate attribute fk. The master calculates corresponding splitting values of candidate attribute in F′, and sends these values {vk,k=1,...,|F′|} to cooperative users.
Recurse this step to split the random tree until the stopping condition is met. Before creating a new tree node, the master will check whether the stop condition is met. To prevent the generated decision tree from overfitting, the stopping condition we adopt is to limit the maximum depth of the tree and the number of remaining samples in the node [12,33]. When the node reaches the stop condition, the cooperating users perturb the classes counts in the node by Laplace mechanism, as is described in the Section 4.3. They send perturbing counts to the master node to calculate the global classes counts and get the class with the largest total count as the label of the leaf node. In the training stage, by aggregating the random attribute values corresponding to the candidate attributes, the best attribute partition selected by exponential mechanism, and the count of leaf labels after perturbing to complete the training of the model.
4.2.2. Incremental selection
This is a personalized approach that we consider from the model level. Ensemble pruning is to select a subset of the ensemble model to form a new ensemble model. Zhou et al. [34] believed that the smaller scale ensemble model after pruning performed better than the original ensemble. Most of the existing ensemble pruning methods are divided into selection or adjusting the weight of the base classifier [35]. We adopt the selected method for ensemble pruning.
Aiming at the random decision tree generated in the similarity learning stage described above, users use the INCRE_SELECT method to adapt the model locally. A user evaluates the importance of the newly generated decision tree to the current ensemble model by testing the performance of the model on the validation set. If a new random decision tree t is added to the user's local personalization model Mp, the accuracy of the model Mp+t on the user's verification set is higher than before. Then the user adds the newly generated random decision tree t to the current ensemble model. The local selection of the newly generated random decision tree just as the previous personalization study makes a local fine-tuning of the neural network obtained in the federated learning. By using incremental selection, users can not only control the number of trees in the model, and reduce the cost of storage and calculation during prediction, but also further personalize improving the local performance of the model.
4.3. Privacy protection
In [36,37], they concerned about how differential privacy interacts with each component of the decision tree algorithm and the conflict that arises when trying to balance the need for privacy and the accuracy of the model. Patil and Singh [38] introduced the concept of differential privacy in the classic random forest algorithm. On the basis of these studies, we analyze the privacy issues in our federated personalized random forest model, and furthermore use differential privacy to protect users privacy during the model training process.
The calculation of information gain and the counts of classes are directly based on the user's data. According to the differential privacy statement, publishing such information may be a leak of privacy, so these potential privacy leaks are exactly what the differential privacy algorithm wants to prevent [37]. Next, we elaborate user privacy protection on the two key steps in model construction: using exponential mechanism to perturb the local optimal attribute in non-leaf nodes, and adding Laplace noise perturbation to the classes counts in the leaf node.
The sensitivity of the information gain calculation function q is Δ(q)=log2|C|, where |C| is the domain size of the class attribute C [38]. Allocate a privacy budget ϵ1=ϵd for the perturbation of local optimal candidate attributes on non-leaf nodes, where ϵ is the privacy budget allocated to a tree of the user, and d is the depth of the tree (including non-leaf nodes and leaf nodes). The exponential mechanism selects a local optimal candidate attribute fk with the following probability:
When determining the leaf label, the classes counts in the user's leaf node are required. The sensitivity of the classes counts is 1. Allocate privacy budget ϵ2=ϵd for leaf node classes counts:
where nc is the number of c type label elements in the node, and the user adds Laplace noise perturbation on the count of each class.
Given privacy budget ϵ for a private random decision tree, we demonstrate the tree building process preserves ϵ-differential privacy. For the perturbation of local best candidate attributes on non-leaf nodes, the allocated privacy budget is ϵ1=ϵd. The privacy budget consumption of each non-leaf node layer of the tree is still ϵ1. The total privacy budget consumed by the d−1 layers of non-leaf nodes is ϵin=ϵ1∗(d−1)=ϵ∗(d−1)d. For each class of count in user's leaf node, allocate privacy budget ϵ2=ϵd. the overall privacy budget allocated on the user leaf nodes is ϵl=ϵ2=ϵd. The total privacy budget for the user to select split attributes and leaf labels in a tree is ϵin + ϵl = ϵ. As a conclusion, ϵ-differential privacy is provided for each tree of the user. All trees are obtained based on the user's training set, the privacy budget budget is accumulated among T trees. The privacy budget consumed by the user to participate in federated learning is B=T∗ϵ.
5.
Experimental results
In this section, we describe in detail our extensive experiments to evaluate the effectiveness of personalized federated random forest. We show the public datasets considered in the experiment and discuss the results obtained on the target datasets.
5.1. Experimental setup
We use the public human activity recognition dataset UCI SmartPhone [39]. This dataset collected 6 activities of 30 users. These 6 activities are walking, going upstairs, going downstairs, sitting, standing, and lying down. The 30 users are between 19–48 years old. Each user wears a smartphone (Samsung Galaxy S Ⅱ) on his waist and uses its built-in accelerometer and gyroscope to collect data generated by activities. We also consider the well-known WISDM dataset [40], which has been widely adopted as a benchmark for human activity recognition tasks. WISDM contains accelerometer data collected from the smartphone in each subject's pocket during the execution of the activity. The activities included in the dataset are as follows: walking, jogging, climbing stairs, brushing teeth, folding clothes and so on. We use the data collected by the mobile phone acceleration sensor in the WISDM dataset. The Table 2 shows the detailed information of the datasets used after preprocessing.
We treat each subject in the activity recognition dataset as an independent user in the actual federated environment. To simulate the heterogeneity of data distribution (non-IID) among users, before training, we randomly perform three different states on each user's data. 1) The user's data is sufficient: the user's original data is retained. 2) Insufficient user data: the user's data is randomly sampled. 3) User data label distribution is unbalanced: a part of the classes is randomly selected. The processed data is used as the actual data in each user's federated training. Then the dataset of each user is randomly divided into three groups, training set, validation set, and test set. Among them, 70% of user data is selected to generate training data, 20% of data is selected to generate test data, and the remaining data is used as user verification data. Through the above settings, we have established an actual complex federated learning environment.
To prove the effectiveness of personalized federated random forest, we compared our privacy-protected federated personalized random forest model (PP-FPRF) with two methods: (1) Local random forest (LRF): The user only trains the random forest model locally. There is no communication among user and others, so the random forest model is only trained based on their own local data, that is, the local model for the user. Users only train locally and do not need to worry about privacy issues. (2) Privacy-protected federated global random forest (PP-FGRF) [12]: All users train a global federated learning model together, without personalized operations. Using differential privacy to protect the intermediate data in the training process, the method of adding noise is the same as that of PP-FPRF.
5.2. Experimental results
We test the performance of the personalized random forest model on the activity recognition datasets, and analyze the impact of the tree settings and the privacy budget on the model.
5.2.1. Classification accuracy
We fix the depth of each tree in LRF (local model), PP-FGRF (global federated model), and PP-FPRF (personalized federated model) to 15, and fix the number of trees in the user random forest model to 20. We assign the same privacy budget ϵ = 1 to each tree of the user in the global federated model and the personalized federated model. The hyperparameter k in the personalized random forest is set to 7, that is, the number of similar users for each user. k is varied from 1 to 11, the experimental results in Figure 2 shows that the performance of the model is better when k reaches 7. It can not only improve the generalization ability of users, but also maintain the personalized characteristics of the models. We train the three models on the SmartPhone and WISDM datasets, and the experimental results are shown in the Tables 3 and 4, where A, B, and C represent three types of users with sufficient data, insufficient data, and unbalanced label distribution, respectively. We can see the average accuracy of different models on these three types of users, as well as the overall accuracy of all users participating in the training.
According to the performance of the model on different datasets, we can see that for all users, the average accuracy of the federated random forest is better than the average accuracy of the random forest trained independently by the users. And our personalization method can further improve the effect of federated random forest. We focus on measuring the average accuracy of several models on the users participating in the training, in addition, we also measured the accuracy of individual users, and then compared whether their participation in federated learning has improved the effect of their models.
We analyze the detailed results of each user in the SmartPhone dataset, as shown in Table 5. For most users, the achieved performance (ie accuracy) of the federated personalized random forest is better than other methods. Clients 1 to 10 are type A users, and their data is relatively sufficient. Clients 11 to 20 are type B users, and their data volume is small. Clients 21 to 30 are type C users, their label distribution is not balanced. Whether it is PP-FGRF or our PP-FPRF, compared with the local models LRF of type B and C that don't participate in federated training, there is a big improvement. Although the generalization ability of the global federated random forest PP-FGRF for the overall users has been improved, for type A users, such users with sufficient data, the global federated learning has not brought effective improvement. That makes these users lose their motivation to participate in federated training. For them, more data from other irrelevant users is equivalent to noisy data, disturbing the effect of the model. Therefore, we use the personalized training method of similar users and incremental selection to better adapt to the user's personal data and effectively alleviate the above mentioned contradiction.
5.2.2. Effect of tree settings
We separately tested the influence of the number of trees and the maximum tree depth in the model. When experimenting with the number of trees, fix the remaining hyperparameters and change the number of trees. By observing the changes in the Figure 3, we can find that the accuracy of the three methods from a single tree to multiple trees has been greatly improved, reflecting the advantages of the forest structure. However, when the number of trees in the model reaches a certain value, the increase in the number of trees has little effect on the results. This shows that it is necessary for us to adopt an incremental selection method to control the number of trees in the model, which is conducive to reducing the storage and computational overhead of the model.
When testing the influence of maximum tree depth, fix the remaining hyperparameters and change the maximum tree depth of the tree. By observing the changes in the Figure 4, the maximum tree depth has a greater impact on the results, and the accuracy of the model increases as the tree depth threshold increases. The local training model converges when the maximum depth is small. When our personalization model reaches convergence, the maximum depth is smaller than the global model's depth. When the global federated learning trains one tree, there are more users participating in the training, so the overall data is more. The range of data feature values is larger, and deeper nodes are needed to divide the data. Therefore, we use the personalized learning of similarity, users only choose similar users instead of all users to train together, which further reduces the complexity of the model.
5.2.3. Effect of ϵ
We observe the change in accuracy by changing the privacy budget ϵ from 0.01 to 1.5. Because the local model does not need to add noise protection, it is not affected by the privacy budget. The results are summarized in Figure 5. The accuracy of the model increases with the increase of the privacy budget. When the privacy budget is small, the accuracy of the model obtained by federated learning is worse than that of the local model. So we should strike a balance between privacy and model utility.
6.
Conclusions
In this paper, based on the existing traditional federated random forest model, we propose a personalized federated learning framework. We pay more attention to the improvement of the model accuracy of each user by personalization, so that the federated random forest model is more suitable for human activity recognition task. The personalization method is considered at the two levels of user data and model. Firstly, using effective locality sensitive hashing functions to collect the similarity information without exposing individual data records, users conduct personalized training by selecting similar users. Then, combining with the ensemble learning pruning operation, the generated random tree is personalized selection by the incremental method. At the same time, differential privacy is used in the training phase to protect the private information of users. The experiments show that PP-FPRF improves the classification accuracy of users in activity recognition tasks, and the personalized method also simplifies the complexity of the federated trees model. The personalization method is introduced into the federated random forest model to ensure that more users benefit from federated learning and are more suitable for actual activity recognition tasks. Using differential privacy to protect user data will also result in loss of model accuracy. In the actual application process, the balance between user privacy and model utility must also be considered. In future works, to ensure the matching of user distribution, we will use users with similar distribution to collaboratively train the federated model, and other privacy protection methods to ensure user privacy.
Acknowledgments
This paper was supported by the National Natural Science Foundation of China (Nos. 62162005 and 61763003), Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security (No. 19-A-02-01), 2021 National Undergraduate Student Innovation Training Program Project (No. 202110602075), Guangxi 1000-Plan of Training Middle-aged/Young Teachers in Higher Education Institutions, Guangxi "Bagui Scholar" Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multisource Information Integration and Intelligent Processing.
Conflict of interest
No potential conflict of interest was reported by the authors.