1.
Introduction
The proliferation of advanced technologies, including Artificial Intelligence (AI), internet of things (IoT), and cloud computing, has led to the emergence of federated learning (FL) as a very promising distributed machine learning approach [1]. Specifically, using the edge network could benefit FL by enhancing the efficiency of learning updates and responses. Meanwhile, FL encounters the challenge of costly communication and time-sensitive training when deploying FL in the edge network. Hence, wireless multi-access channels are introduced in reference [2] as a means to transmit parameters or gradient information, therefore mitigating communication overhead. Furthermore, multi-channel communication in intricate wireless communication networks can conserve resources, provided that the communication information is effectively distributed across various channels [3]. Therefore, the proposed approach is to incorporate the gradient compression mechanism into FL in order to enhance communication efficiency. Gradient compression strategies typically encompass methods like sparsification and quantization. The sparsification technique is employed to compress the local model before transmission to the parameter server (PS), significantly reducing communication costs [4]. Nevertheless, it is possible that the process of sparsification could potentially impede the rate at which convergence occurs. Quantization can potentially decrease the amount of data in a given communication round prior to transmitting gradients to the PS, potentially expediting the convergence process [5].
Although FL addresses certain challenges related to safeguarding data privacy, it has vulnerabilities in terms of security [6,7,8]. One of the primary factors contributing to this issue is the need for more consideration for safeguarding the model parameters in FL. In scenarios where servers and clients are semi-honest, the potential for the inference of sensitive information exists, thus leading to privacy breaches. In this context, current approaches integrate FL with privacy-preserving technologies to augment privacy. These technologies encompass differential privacy (DP) [9,10,11], homomorphic encryption [12], secure multi-party computation [13] and Blockchain [14], among others. DP stands out among the various methods due to its rigorous mathematical description and efficient computational requirements [15].
Additionally, DP can precisely measure data privacy and loss [16,17]. Therefore, it is essential to highlight that a prominent area of focus in the field of DP pertains to exploring various forms of divergence for developing differential privacy variations. In this regard, Rényi differential privacy (RDP) has demonstrated the capability to yield robust privacy outcomes by employing a sequence of random mechanisms when accessing device datasets [18,19,20]. The implementation of DP has the potential to enhance local data privacy. However, it is essential to note that employing encryption procedures with DP may result in increased computational overhead and a decrease in training accuracy, as stated in reference [21]. Therefore, optimizing FL's performance by minimizing communication overhead and computing costs while simultaneously upholding data privacy and security emerges as a pivotal concern.
While DP can effectively safeguard private data by introducing noise, it also introduces potential challenges, such as increased computing overhead and a potential decrease in training accuracy [21]. Motivated by the inherent danger, we present a novel framework SS-RDP-WTS. The proposed scheme involves the utilization of multi-access channels within a wireless communication network to ensure data privacy protection and efficient communication in the context of FL. The technique of gradient compression is implemented by employing a multi-level stochastic quantization approach inside a federated learning framework that incorporates wireless multiple access channels. This approach effectively reduces the burden of communication and processing expenses by selectively sampling and reducing randomness. Furthermore, considering the favorable combination aspects of the RDP, we propose augmenting the privacy protection capability of the enhanced method. The safeguarding of model parameters is achieved through discrete Gaussian noise during client updates, enhancing their security. Additionally, clients' privacy is further ensured by leveraging the privacy amplification effect of uniform subsampling. Our paper provides a theoretical formulation for a comprehensive distributed mean estimation problem. The empirical findings demonstrate that the subsampling private wireless transmission technique efficiently ensures a balance between model utility and communication while offering enhanced privacy protection.
The remainder of this paper is organized as follows. Section 2 describes the relevant prerequisites, including basic definitions and theorems related to DP. Section 3 introduces the proposed system model with a subsampling private wireless transmission scheme. Section 4 analyzes the system model's privacy analysis and convergence rate, section 5 evaluates its performance, and section 6 presents the conclusions.
2.
Background
In this section, we present the system model and give RDP and its related definitions. The commonly used notations are listed in Table 1.
2.1. DP in wireless FL
Wireless FL is a distributed intelligent computing paradigm that aims to protect clients' data privacy while effectively utilizing decentralized data resources and computing power in large-scale devices through local model training on edge devices. In the wireless communication network environment, due to the rapid change of data and the complexity of edge devices, wireless FL can not only update the model in real time, but also improve the learning effect. Moreover, it enables FL to train in a broader range of data and more powerful computing resources to improve the accuracy and performance of the model.
One potential approach to addressing the FL objective while safeguarding client privacy involves training the model using a modified and distorted representation of the data, which does not expose the actual data directly during the training process. To solve this privacy issue, Du et al. [22] implemented a machine learning strategy for smart edges using DP. Seif et al. [23] studied the problem of FL over a wireless channel, modeled by a Gaussian multiple access channel (MAC) and subject to local differential privacy constraints. Liu et al. [24] demonstrated that as long as the privacy constraint level, measured via DP, is below a threshold that decreases with the signal-to-noise ratio, the uncoded transmission achieves privacy "for free", i.e., without affecting the learning performance. During the learning process, model updates using local private samples and large-scale parameter exchanges among agents impose severe privacy concerns and communication bottleneck. To address these problems, Ding et al. [25] proposed two DP and communication efficient algorithms. In this purpose, Wei et al. [26] to minimized FL training delay over wireless channels, constrained by overall training performance as well as each client's DP requirement.
2.2. Privacy definitions
It is known that DP has put private data analysis on the firm theoretical foundation [27,28]. We present the definitions of DP below.
Definition 1. (ε - DP [29]). For ε>0, a randomized algorithm M is said to be ε -differentially private if, and only if, for any two adjacent sets D1,D2 and any S⊆Range(M), it holds that
where ε is the privacy budget.
In order to attain anonymity, it is imperative that the algorithm M has a randomized component. A reduced privacy budget implies that the adversary has less capacity to discern the existence of any data based on the output.
Definition 2. ((ε,δ) - DP [9]). For ε,δ>0, a randomized algorithm M is said to be (ε,δ) -differentially private if, and only if, for any two adjacent set D1,D2 and any S⊆Range(M), it satisfies that
Definition 3. (RDP [19]). For ε>0 and α>1, a randomized mechanism M satisfies (α,ε) -RDP if, and only if, for any two adjacent sets D1,D2∈S, it holds that
It is evident that RDP is strictly stronger than (ε,δ)-DP for δ>0, and it enables more precise bounds for the composition of the Gaussian mechanism. The subsequent result is employed to transform an RDP into a central DP.
Lemma 1. (From RDP to DP [30]). Suppose for any α>1, a mechanism M is (α,ε) -RDP, then, the mechanism M is (ε+log(1/δ)/(α−1),δ) -DP for all 0<δ<1.
2.3. The advantages of RDP over DP
DP employs the concept of sensitivity to quantify the level of data privacy exposure. Sensitivity refers to the utmost extent to which the outcomes of a complete database query are influenced by the alteration of a single individual's data. Nevertheless, RDP offers an enhanced flexibility and a wider range of choices with the incorporation of a sensitivity parameter α for measurement.
To begin, RDP enables the selection of suitable sensitivity levels based on diverse application scenarios, hence, improving the adaptability of privacy protection to a range of data processing requirements. RDP, for instance, was equal to DP at α=1. However, RDP is appropriate for instances with higher privacy requirements and might offer stronger privacy protection when α>1. In contrast, when α<1, enabling RDP resulted in increased query accuracy. On the contrary, when α<1, RDP led to a notable improvement in query accuracy.
Furthermore, when compared to DP, RDP demonstrates superior efficacy in safeguarding privacy inside certain attack models. In certain scenarios, such as when facing a reorganization assault or when numerous queries are allowed, it has been observed that DP may not offer adequate security for privacy. DP synthesis can be achieved in RDP by strategically selecting suitable α values, resulting in enhanced privacy protection performance.
Furthermore, RDP exhibits superior composability. This feature enables the integration of diverse privacy strategies in order to enhance privacy protection, while still upholding privacy performance. The inherent composability of this feature holds significant importance in facilitating advanced data analysis and ensuring the secure exchange of data.
2.4. Discrete Gaussian mechanism via subsampling
We state the definitions of the discrete Gaussian mechanism and establish relevant privacy guarantees. We first introduce discrete Gaussian distribution.
Definition 4. (Discrete Gaussian distribution [31]). Let σ∈R with σ>0. Discrete Gaussian distribution is a probability distribution on a discrete additive subgroup L. The discrete Gaussian distribution with scale σ is denoted as NL(σ), x∈L, and the probability mass on x is proportional to e−x2/2σ2.
In order to ascertain the magnitude of the introduced noise, it is necessary to impose limitations on the ℓ2 -sensitivity of the gradient aggregation. Within the context of DP, the process of calibrating the amount of noise to be added is influenced by the sensitivity of the function, as explicitly specified as follows.
Definition 5. (ℓ2 -sensitivity [32]). Suppose that a function f:D→R and two adjacent datasets are D1 and D2. The ℓ2 -sensitivity of f is defined as ΔfΔ=maxD1,D2∈D‖f(D1)−f(D2)‖2.
Lemma 2. (RDP for discrete Gaussian mechanism [18]). A mechanism, which alters the output of another algorithm range(f)⊆L, by adding Gaussian noise, i.e., f(.)+NL(σ), satisfies (α,ε′)-RDP with ε′=α(Δf)2/2σ2.
The discrete Gaussian mechanism is realized by adding noise with discrete Gaussian distribution to the output function evaluated on a sensitive dataset. Canonne et al. (2020) demonstrated concentrated DP for the discrete Gaussian mechanism and provided a thorough analysis of the privacy and utility properties of the discrete Gaussian [30]. For FL, Wang et al. (2021) proposed a discrete Gaussian-based RDP with an analytical moments accountant-based RDP [20].
It is significant to exploit the randomness in subsampling. Wang et al. (2018) remarked that discrete Gaussian exhibits tight privacy amplification bound via subsampling and indicated that RDP enables the discrete Gaussian mechanism to be composed tightly with an analytical moments accountant, which saves the privacy budget in a multi-round FL [33].
3.
Our proposed SS-RDP-WTS scheme
This section presents our proposed SS-RDP-WTS, as shown in Figure 1.
The proposed system performs an iterative process in which the following steps are repeated: 1) The PS broadcasts the initialization model to a randomly selected subset of local clients. Each client then performs local random gradient descent using its local data to obtain an updated local gradient. 2) To tackle the communication bottleneck, the local gradient is compressed using multi-level random ladder quantification for the clients that are selected through subsampling. 3) To further enhance the algorithm's privacy protection capability, discrete Gaussian noise is employed in the quantized discretized gradient. 4) The model parameters of the noisy version are transmitted through wireless global mobile access communication (GMAC) to improve the communication efficiency of the model. 5) The PS aggregates the uploaded model parameters in an average manner to obtain a new global model.
SS-RDP-WTS: We study an FL system with with a central server, N clients, and a wireless communication technique. The clients establish communication with the processing system through wireless GMAC technology. The pseudocode of this algorithm is shown in Algorithm 1.
Each client k has a local dataset Dk, which includes the number of data points, denoted as u(k)i. Each data point i within client k's dataset is associated with a label, denoted as v(k)i. Its size is denoted as |Dk|, and so Dk={(u(k)i,v(k)i)}|Dk|i=1. The client establishes communication with the PS through GMAC to train the model. This training process involves minimizing the loss function ℓ(w)
where w∈Rd is the parameter vector, ℓk(⋅) is the loss function for client k, and Dtotal=∪Ktk=1Dk is the total dataset participating in training minimized ℓ(w) iteratively by the stochastic gradient descent (SGD) algorithm. In the training iteration t, the PS broadcasts the global parameter vector wt to the clients. Each client k calculates its local gradient on the local dataset
Then, the clients statistically quantify the gradient values into a discrete domain and implement the discrete Gaussian technique to ensure anonymity. During the quantification process, quantification parameters are optimized considering the capacity level of the wireless GMAC. To streamline the presentation, we have omitted the iteration index t from our paper. During each iteration, each client performs quantization on its local vector gtk by dividing it into btk discrete levels, ensuring that the quantized value lies within the range [−gmaxk,gmaxk]. Let gmaxk be the clipping bound L. For every integer φ in the range [0,btk),
where btk≥2 represents the quantization level for client k. It is usual to choose sk as 2gmaxk, and Bk(φ) is the index. If gk,j∈[Bk(φ),Bk(φ+1)) holds for the client k, it follows that
So, qk,j is an unbiased estimator of gk,j, meaning that E[qtk,j]=gtk,j. Additionally, the variance can be bounded
After quantizing the complete gradient vector, the Gaussian technique is subsequently employed. This method is not suited for transferring quantized local gradients. One alternative method for ensuring privacy is the incorporation of discrete Gaussian noise. If the local privacy model is outside the quantization range, it is to employ preprocessing techniques to substitute it.
Subsequently, client k uses a preprocessing function to acquire the code word xtk, denoted as xtk=ftk(qtk,j), and transmits it to PS. In our paper, we examine the random set of participants Kt, obtaining using uniform subsampling. Among them, the client k participates in the training process during iteration t with probability P. When Kt=N, it can be inferred that all clients are engaged in the training. The signal generated by the client k during iteration t is
where nk,t∼NL(0,σ2k,tId).
In addition, when clients transmit xtk to the PS through wireless GMAC, a total of dlog2b bits are required for transmitting present the quantized gradient vector, where the transmission rate of clients is rtk=dlog2btk, which adheres to the GMAC capacity region [30] given by the inequality
where CKt is the combined capacity of wireless GMAC in subset Kt. Suppose that the channel inputs during each iteration adhere to the average power constraint, denoted as ‖xtk‖22≤NPk,∀k. Furthermore, CKt=0.5log(1+∑Ktk=1pk/N0).
The input-output relationship over GMAC can be expressed as follows:
Here, xtk∈Rd is the signal transmitted by client k at iteration t and mt∼NL(0,N0) is the independent identically distribution (IID) channel noise.
After the PS receives signal yt, the objective of the PS is to acquire the average value of the accurate gradient by utilizing a post-processing function ht(⋅). This average value can be represented by ˆgt=[∑Ktk=1yk]/μ|Kt|. Nevertheless, due to the implementation of preprocessing and post-processing techniques, as well as the GMAC system capabilities, the PS can only decode and restore the perturbed local gradients.
4.
Theoretical analysis
In this section, we present the primary results of our proposed approach. In this section, we analyze the RDP features of the subsampling private wireless transmission system. Specifically, we discuss the privacy guarantee outlined in Theorem 1. Additionally, we introduce the privacy-convergence trade-off of the scheme, as shown in Theorem 2.
4.1. Privacy analysis under SS-RDP-WTS
The subsequent theorem elucidates the RDP guarantee associated with the discrete Gaussian mechanism within the subsampling privacy model, as outlined in Theorem 1. Initially, a process is conducted to select a subset of Kt≤N clients from a more extensive set of clients. This subsampling is performed independently and uniformly, with each client having an equal likelihood of being selected. The probability of selection is denoted by P=Kt/N. Subsequently, within the chosen subset of k clients, each client k proceeds to quantize its respective local gradient and subsequently implements a discrete Gaussian process.
Theorem 1. Let the bound for clipping be L, the scale of noise scale be σ and the quantization level be bk with bk≥bmin. The proposed scheme meets (α,ε′(α))-RDP with
Proof. Note ∑Lexp(−(x−(1−α)μ)2/2σ2)∑Lexp(−(x−μ)2/2σ2)≤1 and range(f)⊆L. Thus
According to Lemma 1, the discrete Gaussian mechanism adheres to (α,α(Δf)2/2σ2) -RDP.
Next, we prove the bound of the sensitivity for client k. Note yt=∑Ktk=1gtk+Zt and the variance of the effective Gaussian noise Zt is σ2=∑Ktk=1σ2k,t+N0 and ΔfΔ=maxD1,D2∈D‖f(D1)−f(D2)‖2, ‖gtk‖2≤L,∀k, where L is the Lipschitz parameter. Thus,
According to [20], the upper bound on the quantized ℓ2-sensitivity is denoted as
Thus, it is sufficient to show that the composition of quantization and discrete Gaussian mechanism is (α,α(Δf)2/2σ2) -RDP if we have
According to Theorem 9 in [15], we obtain
Our scheme meets (α,ε′(α)/2σ2) -RDP. This completes the proof of Theorem 1.
Remark: Our proposed system offers a more robust privacy guarantee unlike prior research. This can be primarily illustrated by considering the following factors:
1) The proposed scheme adheres to RDP principles, which offer enhanced privacy measures and a more stringent composition, which have been shown in [19].
2) As a result of implementing privacy amplification through subsampling and the addition of discrete Gaussian noise, our scheme maintains a reduced noise level.
4.2. The convergence rate on SS-RDP-WTS
Theorem 2. Assuming the loss function ℓ(.) is λ -strongly convex, μ-smooth, and L-Lipschitz gradient, the learning rate is ηt=1/λt. The convergence rate satisfies
Proof. Note that
In the iteration t, the PS averages the received Kt gradients, which is
After post-processing, we get ˜gt=˜f(ˆgt). Note that Kt is a binomial random variable, and the probability of each client being selected is the subsampling probability of P=Kt/N, so μ|Kt|=NP,σ2|Kt|=NP(1−P).
Then we obtain the bounds on the second moment of the gradient
where (a) follows that for any two vectors m,n∈Rd and ρ>0, 2⟨m,n⟩≤ρ‖m‖22+ρ−1‖n‖22, (b) follows that the the local gradients satisfying the Lipschitz condition are bounded by ‖gtk‖2≤L,∀k.
According to the variance expression of the gradient estimate and the expected properties, we have
where (c) follows by E[(qtk,j−gtk,j)2]=Var[qtk,j]≤L2/(btk−1)2, Zt=∑k∈Ktnk,t+mt∼NL(0,σ2ZtId), and nk,t∼NL(0,σ2k,tId). Since mt∼NL(0,N0), we have
We complete the proof of Theorem 2.
Remark: The convergence rate of our scheme is determined by applying the established convergence results of SGD, as in reference [34,35]. We subsequently compute the necessary parameters to ensure convergence. Furthermore, we demonstrate the potential for enhancing the convergence rate of our methodology by adjusting the quantization level, subsampling probabilities and noise parameters within the constraints of a specific privacy budget and communication set. The convergence rates per round of SS-RDP-WTS can be improved due to the more compact structure of subsampling RDP, given a specific noise level, in accordance with the convergence rate boundary. Simultaneously, discrete Gaussian noise offers more robust privacy assurances at equivalent noise levels.
To characterize the convergence of our scheme, the optimum values of quantization level and noise parameters maximizing the convergence rate in Theorem 2 are the solutions to an optimization problem, but the optimization problem is an instance of constrained integer nonlinear programming (INLP). To simplify the calculation, we set some given noise parameters. Therefore, the optimization problem can be written as
Computation cost analysis of SS-RDP-WTS: Our proposed approach involves calculating the gradient of the clients, which is influenced by factors such as the dataset size of each client and the complexity of the model. Typically, the computation of the gradient on the client side is considered a reasonably low-cost process, as it only entails navigating through local data and making adjustments to model parameters. The discretization operation is a technique that partitions the gradient values into distinct regions, enhancing the privacy preservation of the gradient. The discretization procedure is generally cost-effective because it mainly focuses on aligning gradient values with discrete regions. To safeguard individual privacy, noise is introduced as an additional measure. This is accomplished by perturbing the discretized gradient, a commonly employed and cost-effective procedure. Our proposed scheme achieves this objective by directly injecting noise into the gradient.
Communication cost analysis of SS-RDP-WTS: The initial step involves selecting a smaller portion of the dataset, followed by determining the client's participation in training based on the probability P. The outcome of the client to engage in training is then communicated to the server. The communication overhead associated with this process is minimal, as it entails transmitting a single binary decision outcome. The second component pertains to the transportation gradient, wherein each individual communicates the discretized, noise-injected and amplified gradient to the server-side using the wireless GMAC. The presence of communication overhead is contingent upon the dimensionality and precision of the gradient. Lastly, the received gradient refers to the process in which the server side is responsible for receiving gradients from all clients and subsequently computing the global average gradient. This procedure entails obtaining gradient data from each client, which is contingent upon the number of clients and the quantity of the gradient data.
5.
Performance evaluation
In this section, we evaluate the efficacy of our proposed strategy. We examine two learning tasks utilizing convolutional neural networks training on CIFAR10 and MNIST datasets, employing with a cross-entropy loss function. The model's dimensionality on CIFAR10 and MNIST is d=62,006 and d=44,426, respectively. In our training process with the CIFAR10 dataset, we utilize IID partitions corresponding to 10 clients. However, in the MNIST experiment, we employ non-IID partitions corresponding to 10 clients.
Furthermore, it is necessary to establish a given subsampling probability. In this context, let δ=10−5 for different subsampling probabilities P. In particular, when P=1, it indicates that all clients are involved in the training process. Let σk,t=1,∀t∈[T] be the Gaussian noise. Each client applies a clipping operation to the local gradient, utilizing an empirically determined Lipschitz constant, denoted as L=1. To perform the calculation of RDP accounting, the Google's DP library is employed. In the experiment, we conducted tests on the variable α within the range of two to 64, along with other numerical values like 128,256 and 512. Our objective was to identify the minimum value of ε through the process. The identified of the minimal value of the quantization level is given as bmin=101 in Figure 2 while bmin=64 in Figures 3 and 4. To facilitate the process of comparing different bounds, the RDP bounds are translated into the epsilon-delta differential privacy framework.
In Figure 2, it can be shown that the values ε corresponding to P=1 and P=0.1, exhibit relatively high levels, indicating a lower degree of privacy protection. It has been observed that reducing the subsampling probabilities leads to an improvement in the privacy guarantee. In the context of the actual wireless network environment, a significant number of clients are involved in the training process. Hence, it would be rational to choose for a reduced subsampling probability.
Figure 3 examines the relationship between model accuracy and noise level variation, while maintaining a constant sampling probability. Hence, opt to select two clients for participation in the training process, with an average transmit power limitation of p1=320 and p2=80 for the respective datasets.
The number of channels in the MAC is set to s=5d for each iteration. The minimum value of the quantization level is given as bmin=64. To maintain the quantized attributes, it is postulated that the privacy local model undergoes truncation when it is beyond the truncation threshold, denoted as qmax+3σk,t. Here, qmax denotes the highest value of the quantized gradient. The findings illustrate that the magnitude of the discrete Gaussian noise has a certain impact on the convergence rate outlined in Theorem 2, once the accuracy of the model has already reached convergence. Simultaneously, when the magnitude of noise decreases, the level of precision increases.
Figure 4 illustrates the model's accuracy when subjected to the various privacy budgets. Let σk,t=1,∀t∈[T] be the Gaussian noise. As the subsampling private wireless transmission strategy converges, its accuracy gradually approaches that of the scheme without privacy, given the various privacy budgets and the same subsampling probability. Furthermore, the enhancement of our method in terms of model accuracy post-convergence has a greater magnitude on the MNIST dataset compared to the CIFAR10 dataset. Furthermore, it can be observed that the model achieves convergence with a reduced privacy budget when applied to the CIFAR10 dataset. This phenomenon occurs to a certain degree once the accuracy of the model has reached convergence. Simultaneously, when the scale of noise decreases, the level of precision increases.
6.
Conclusions
In this paper, we presented a novel SS-RDP-WTS that aimed to minimize communication costs and enhance the privacy of FL. In the FL framework with MAC, the model gradient was compressed using multi-level stochastic gradient quantization to address the constraint of limited communication resources. This compression technique helped minimize communication costs and enhance algorithm communication efficiency by means of subsampling. Additionally, the system incorporated the introduction of discrete Gaussian noise and leveraged the privacy amplification effect of subsampling to strengthen the privacy protection measures, taking into account the closely intertwined characteristics of RDP. The theoretical analysis of the subsampling private wireless transmission technique encompassed the examination of its convergence and privacy boundary. Based on the empirical findings, the implemented scheme exhibited the capability to enhance the efficacy of the FL algorithm, concurrently mitigating communication burdens and safeguarding data confidentiality.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was supported by the Major Research Plan of the National Natural Science Foundation of China (92167203), Key Program of Zhejiang Provincial Natural Science Foundation of China (LZ22F020007) and Zhejiang University of Science and Technology Postgraduate Research and Innovation Fund (2022yjskc24).
Conflict of interest
The authors declare that they have no conflict of interest.