Privacy-preserving Naive Bayes classification based on secure two-party computation

Kun Liu; Chunming Tang; Kun Liu; Chunming Tang

doi:10.3934/math.20231459

AIMS Mathematics

2023, Volume 8, Issue 12: 28517-28539. doi: 10.3934/math.20231459

Previous Article Next Article

Research article Special Issues

Privacy-preserving Naive Bayes classification based on secure two-party computation

Kun Liu ,
Chunming Tang ^,

School of Mathematics and Information Science, Guangzhou University, Guangzhou 510006, China

Received: 11 September 2023 Revised: 13 October 2023 Accepted: 15 October 2023 Published: 19 October 2023
MSC : 94A99

With the proliferation of data and machine learning techniques, there is a growing need to develop methods that enable collaborative training and prediction of sensitive data while preserving privacy. This paper proposes a new protocol for privacy-preserving Naive Bayes classification using secure two-party computation (STPC). The key idea is to split the training data between two non-colluding servers using STPC to train the model without leaking information. The servers secretly share their data and the intermediate computations using cryptographic techniques like Beaver's multiplication triples and Yao's garbled circuits. We implement and evaluate our protocols on the MNIST dataset, demonstrating that they achieve the same accuracy as plaintext computation with reasonable overhead. A formal security analysis in the semi-honest model shows that the scheme protects the privacy of the training data. Our work advances privacy-preserving machine learning by enabling secure outsourced Naive Bayes classification with applications such as fraud detection, medical diagnosis, and predictive analytics on confidential data from multiple entities. The modular design allows embedding different secure matrix multiplication techniques, making the framework adaptable. This line of research paves the way for practical and secure data mining in a distributed manner, upholding stringent privacy regulations.

Keywords:

Citation: Kun Liu, Chunming Tang. Privacy-preserving Naive Bayes classification based on secure two-party computation[J]. AIMS Mathematics, 2023, 8(12): 28517-28539. doi: 10.3934/math.20231459

Related Papers:

[1]	Lei Chen, Ruyun Qu, Xintong Liu . Improved multi-label classifiers for predicting protein subcellular localization. Mathematical Biosciences and Engineering, 2024, 21(1): 214-236. doi: 10.3934/mbe.2024010
[2]	Yongyin Han, Maolin Liu, Zhixiao Wang . Key protein identification by integrating protein complex information and multi-biological features. Mathematical Biosciences and Engineering, 2023, 20(10): 18191-18206. doi: 10.3934/mbe.2023808
[3]	Linlu Song, Shangbo Ning, Jinxuan Hou, Yunjie Zhao . Performance of protein-ligand docking with CDK4/6 inhibitors: a case study. Mathematical Biosciences and Engineering, 2021, 18(1): 456-470. doi: 10.3934/mbe.2021025
[4]	Yutong Man, Guangming Liu, Kuo Yang, Xuezhong Zhou . SNFM: A semi-supervised NMF algorithm for detecting biological functional modules. Mathematical Biosciences and Engineering, 2019, 16(4): 1933-1948. doi: 10.3934/mbe.2019094
[5]	Haipeng Zhao, Baozhong Zhu, Tengsheng Jiang, Zhiming Cui, Hongjie Wu . Identification of DNA-protein binding residues through integration of Transformer encoder and Bi-directional Long Short-Term Memory. Mathematical Biosciences and Engineering, 2024, 21(1): 170-185. doi: 10.3934/mbe.2024008
[6]	Madeleine Dawson, Carson Dudley, Sasamon Omoma, Hwai-Ray Tung, Maria-Veronica Ciocanel . Characterizing emerging features in cell dynamics using topological data analysis methods. Mathematical Biosciences and Engineering, 2023, 20(2): 3023-3046. doi: 10.3934/mbe.2023143
[7]	Wenjun Xia, Jinzhi Lei . Formulation of the protein synthesis rate with sequence information. Mathematical Biosciences and Engineering, 2018, 15(2): 507-522. doi: 10.3934/mbe.2018023
[8]	Jinmiao Song, Shengwei Tian, Long Yu, Qimeng Yang, Qiguo Dai, Yuanxu Wang, Weidong Wu, Xiaodong Duan . RLF-LPI: An ensemble learning framework using sequence information for predicting lncRNA-protein interaction based on AE-ResLSTM and fuzzy decision. Mathematical Biosciences and Engineering, 2022, 19(5): 4749-4764. doi: 10.3934/mbe.2022222
[9]	Sathyanarayanan Gopalakrishnan, Swaminathan Venkatraman . Prediction of influential proteins and enzymes of certain diseases using a directed unimodular hypergraph. Mathematical Biosciences and Engineering, 2024, 21(1): 325-345. doi: 10.3934/mbe.2024015
[10]	Shun Li, Lu Yuan, Yuming Ma, Yihui Liu . WG-ICRN: Protein 8-state secondary structure prediction based on Wasserstein generative adversarial networks and residual networks with Inception modules. Mathematical Biosciences and Engineering, 2023, 20(5): 7721-7737. doi: 10.3934/mbe.2023333

Abstract

1. Introduction

Essential proteins are required for organism life, and their absence results in the loss of functional modules of protein complexes, as well as the death of the organism ^[1]. Essential proteins identification aids in the understanding of cell growth control mechanisms, the discovery of disease-causing genes and possible therapeutic targets, and has crucial theoretical and practical implications for drug development and disease therapy. In biological experiments, essential proteins are mainly identified by gene culling, gene suppression, transposon mutation and other methods, which cost lot of time and difficult unfortunately. Essential protein identification using computational approaches becomes achievable as high-throughput data accumulates. This identification method means utilizing the available data to find the key features that affect the importance of proteins and to determine if it is important of biological functions based on these features. The most common measuring technique is based on the topological properties of the PPI network to obtain network topology features, like Degree Centrality (DC) ^[2], Information Centrality (IC) ^[3], Closeness Centrality (CC) ^[4] and Subgraph Centrality (SC) ^[5], Betweenness Centrality (BC) ^[6], sum of Edge Clustering Coefficient Centrality (NC) ^[7]. These methods are sensitive to network structure, so false positive noise and data missing will reduce the performance of prediction easily.

In addition to characteristics of network topological, the biological characteristics involved in essential proteins identification mainly include sequence features and functional features. Zhang and Li et al. combined features of profiles of gene expression with topological features of PPI network, and proposed CoEWC ^[8] and PeC ^[9] methods respectively. Zhao et al. ^[10] put forward an essential protein detection model named POEM which utilize the module features of essential proteins. A weighted network with high confidence is built based on the topological structure and intrinsic characters of network and information about expression of genes, and overlap of functional modules, that coupling nature is weak and cohesive nature is strong, are discovered. In the end, the weighted density of the module to which the protein belongs was used to determine scores. Zhang et al. ^[11] got a new model named FDP to employs the global and local topological properties of network and protein homology information, to combine the dynamic PPI network at different times. In 2021, Zhong et al. ^[12] introduced a novel measuring approach named JDC that binary gene expression data with a dynamic threshold and combines the Jaccard index of similarity and degree centrality.

The method based on multi-source data integration effectively improve the prediction's level of accuracy and robustness. The commonly used processing method is to build a highly reliable weighted PPI network through weighted summary and the features are different for different prediction methods. The processing method of simple superposition will obfuscate the complicated relationship that exists between the multi-source data and generate artificial noise. The parameter setting is also matter which will influence the practical application of the algorithm. Non-negative matrix tri-factorization (NMTF) ^[13] is mainly used to analyze data matrices with non-negative elements, disintegrate the input matrix into three non-negative factor matrices, and approximate the input matrix through low-rank non-negative representation. It has been widely used in many fields such as text mining ^[14], recommendation system ^[15,16] and biological data analysis ^[17,18].

In view of the advantages of NMTF in data analysis and integrate protein homology information and subcellular location information to improve the prediction performance of essential proteins, an approach of non-negative matrix symmetric tri-factorization (IEPMSF) is offered as an optimal method for solving the noise problems in identifying essential proteins. In order to avoid more noise caused by multi-source data integration, this paper only uses the topological features of the original protein interaction data to construct the protein weighted network. But this method is not optimal because of the existence of false negatives and false positives. To solve this problem, the traditional NMTF algorithm is optimized. The factorization process is regarded as the "soft clustering" process of proteins, to predict the potential protein-protein interactions by a non-negative matrix symmetric tri-factorization algorithm (NMSTF), thus forming the optimal protein weighted network. Finally, to achieve the goal of predicting essential proteins, the homology information and subcellular location information of proteins are combined to create an initial score for each protein, which is then used to score and order each protein in the optimized network using the restart random walk algorithm.

2. Materials and methods

2.1. Basic framework of the model

This paper builds an improved protein-weighted network using the protein-protein interaction network and the NMSTF algorithm to increase the accuracy of important protein identification, and integrates subcellular localization information with protein homology information to design an essential model to identify essential proteins, IEPMSF. The model consists of three modules: weighted network building module, weighted network optimization module, and proteins scoring and sorting module.

Figure 1. Overall workflow of IEPMSF for identifying essential proteins.

DownLoad: Full-Size Img PowerPoint

2.1.1. Weighted network construction module

Through topological analysis of yeast networks, the researchers found that PPI networks have small-world and non-scale characteristic ^[19] and that essential proteins have a strong connection with the topological properties of proteins. The co-neighbor coefficient is commonly utilized in the functional recognition ^[20] of proteins in PPI networks, demonstrating that the more similar neighbors two proteins in a network have, the more likely they are to interact. To measure the degree of interaction between the two proteins, we use the co-neighbor coefficient to give the edge weights of the network of protein interaction.

A simple undirected graph G = (V, E) can be a model of a PPI network. Here, the nodes set V = {v₁, v₂, …} as proteins, the edges set E = {e₁, e₂, e₃ …} is a representation for the interaction of two different proteins. Defining a weighted network is WG = (V, E, P), where P(i, j) indicating the likelihood of the interaction of the v_i and v_j proteins, can be computed using the equation below :

$P(i, j) = \left\{ {\begin{array}{*{20}{c}} {\frac{{\left|Nei\left(i\right)\cap Nei\left(j\right)\right|}^{2}}{(\left|Nei\left(i\right)\right|-1)\mathrm{*}(\left|Nei\left(j\right)\right|-1)}}&{if\ \left|Nei\left(i\right)\right| > 1\ and\ \left|Nei\left(j\right)\right| > 1}\\ {0}&{otherwise} \end{array}} \right.$

(1)

where $Nei\left(i\right)$ and $Nei\left(j\right)$ respectively represent collection of neighbor nodes of the v_i and v_j, $\left|Nei\left(i\right)\cap Nei\left(j\right)\right|$ represent the number of common neighbors. If there are not any common neighbor proteins between the v_i and v_j, then P(i, j) = 0. We are going to assume that the probability of the interaction, the co-neighbor coefficient between the proteins, is independent of each other, and it's going to be in the range of 0 to 1.

2.1.2. Weighted network optimization module

As previously stated, false positives and false negatives can be found in PPI networks derived from high-throughput biological research. In other words, there are still some uncertainties in the construction of weighted networks based on protein interactions. NMTF was proposed by Ding in 2006 ^[13], which is an effective tool applied to recommendation systems successfully. Therefore, we can exploit the potential new protein interactions based on the existing protein and protein interaction data by using NMTF technology.

The traditional NMTF is the decomposition of the correlation matrix Y^n*n into three low-rank sub-matrices, $F\in {R}^{n*k}$ , $S\in {R}^{k*k}$ and $G\in {R}^{k*n}$ , by which to approximate the original input matrix, as follows:

$P\approx Y = FS{G}^{T}$

(2)

Where the parameter k represents the factorization level and reflects the total number of possible vectors in the column spaces and row spaces. After being weighted to the protein interaction network with co-neighbor coefficients, the association matrix of the network can be constructed to represent the connection relationship between proteins. The elements in the correlation matrix are the co-neighbor values for each edge. Due to the singularity of nodes in the protein interaction network and the resulting correlation matrix is a symmetry matrix, the simple utilization of the conventional NMTF technology is not reasonably explanatory. Hart ^[21] pointed out that essential proteins often gather together, and the criticality of proteins is related to protein complexes rather than dependent on a single protein, which indicates that essential proteins have modular properties. Specifically, given a non-negative input matrix P, factor matrix S can be seen as the cluster index ^[14] of the vertex. Based on this, this paper proposes an improved NMTF algorithm called a non-negative matrix symmetric three-factors decomposition to rewrite Eq (2) into the following form:

$P\approx Y = US{U}^{T}$

(3)

Among them, $U\in {R}^{n*k}$ can be seen as "soft" clustering labels of proteins, and $S\in {R}^{k*k}$ as a correlation matrix between protein modules, S = S^T. Then we can design the loss objective function of Eq (3) as follows:

$D = \underset{U\ge 0, S\ge 0}{\mathrm{min}}J(U, S) = {||P-US{U}^{T}||}_{F}$

(4)

Where ‖⋅‖_F refers to the Frobenius specification. We use the multiplication update iteration technique to derive the objective function on the basis of employing the auxiliary function because the object function is a joint nonconvex problem. According to the rules of Squared frobenius norm we can know ||X||₂ = Tr(X^TX), which can solve D as follows:

$D = Tr({P}^{T}P-2{P}^{T}US{U}^{T}+U{S}^{T}{U}^{T}US{U}^{T})$

(5)

Solve partial differential equations for U and S factors in Eq (5) respectively:

$\frac{\partial D}{\partial U} = -4PUS+4US{U}^{T}US$

$\frac{\partial D}{\partial S} = -2{U}^{T}PU+2{U}^{T}US{U}^{T}U$

(6)

Followed as Karush-Kuhn Tucker (KKT) complementary condition, we can find a static point, the KKT condition for U and S. These rules can be written as follows:

$\frac{\partial D}{\partial {U}_{ik}}{U}_{ik} = 0$

(7)

By Eq (7), we can get:

${(US{U}^{T}US-PUS)}_{ik}{U}_{ik} = 0$

${U}_{ik} = {U}_{ik}\frac{{\left(PUS\right)}_{ik}}{{\left(US{U}^{T}US\right)}_{ik}}$

(8)

Similarly, the S can be calculated using the same procedure:

${S}_{ik} = {S}_{ik}\frac{{\left({U}^{T}PU\right)}_{ik}}{{{(U}^{T}US{U}^{T}U)}_{ik}}$

(9)

These rules can be expressed in a matrix form:

${U}_{ik}\leftarrow {U}_{ik}\frac{{\left(PUS\right)}_{ik}}{{\left(US{U}^{T}US\right)}_{ik}}$

${S}_{ik}\leftarrow {S}_{ik}\frac{{\left({U}^{T}PU\right)}_{ik}}{{{(U}^{T}US{U}^{T}U)}_{ik}}$

(10)

According to the above multiplication update iteration rules, the final U and S can be calculated, so as to obtain the optimal Y = USU^T approximating the original input matrix.

After the above data processing, we construct an optimized network association matrix, and conduct the corresponding standardization processing as follows:

${P}^{*}\left(i, j\right) = \left\{ {\begin{array}{*{20}{c}} {\frac{\mathrm{m}\mathrm{a}\mathrm{x}({Y}_{ij}, {Y}_{ji})}{\sum _{k = 0}^{N}{Y}_{ik}}, }&{\sum _{k = 0}^{N}{Y}_{ik}\ne 0}\\ {0 , }&{else} \end{array}} \right.$

(11)

The cumulative sum of each row of i in the matrix P^* is 0 or 1.

2.1.3. Proteins scoring and sorting module

We give an initial score to every protein from protein interaction network given by direct homology information and sub-cell localization information to improve the accuracy of essential protein prediction.

Studies have shown that when a protein has more homologous proteins in a reference species, it is highly likely to be an essential protein. The direct homology score of protein node v_i is calculated by the equation below:

$HS\left(i\right) = \frac{HP\left(i\right)}{\underset{1\le j\le \left|V\right|}{\mathrm{max}}HP\left(j\right)}$

(12)

where HP(i) represent how many direct homologous proteins in the reference species collection SC node v_i has, as follows:

$HP\left(i\right) = \sum _{m\in SC}{TN}_{i}\ \mathrm{w}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{e}\ {TN}_{i} = \left\{ {\begin{array}{*{20}{c}} 1&{if\;{v_i} \in X{S_m}}\\ 0&{otherwise} \end{array}} \right.$

(13)

where the XS_m represents a collection of proteins with direct homologous proteins and is a subset of V. For those proteins that possess homologous proteins in all reference species, their direct homology score of 1 is given. Instead, if a protein does not have a direct homologous protein in all reference species, it has a score of 0.

Previous research has revealed that the essential state of proteins is not simply linked to the biological properties of PPI networks, but also to their location in space. Therefore, making full use of subcellular localization information is important for essential proteins prediction. Studies have shown that essential proteins are found in higher concentrations in certain subcellular locations than non-essential, and evolve more conserved ^[22]. Let L(R) be the protein set appearing at subcellular location r, and the frequency of protein appearing at it is possible to calculate each subcellular location r, as shown below:

$OF\left(r\right) = \frac{\left|L\left(r\right)\right|}{\underset{k\in R}{\mathrm{max}}\left|L\left(k\right)\right|}$

(14)

Where |L(r)| represents the number of proteins present at subcellular location r, and R represents the set containing each subcellular location. For a protein v_i, let C(i) be the set of subcellular sites in which it occurs, and the definition of subcellular localization score LS(i) is the score of the maximum frequency of its occurrence at all subcellular locations by using the following equation:

$\mathrm{L}\mathrm{S}\left(\mathrm{i}\right) = \underset{r\in C\left(i\right)}{\mathrm{max}}OF\left(r\right)$

(15)

Combined with the direct homology score and subcellular localization score obtained by Eqs (12) and (15), the initial value score, IS(i), which is possible to compute the v_i of each protein in the protein interaction network, with following equation:

$HS\left(i\right) = HS\left(i\right)\times \mathrm{L}\mathrm{S}\left(i\right)$

(16)

Based on the weighted network constructed previously and the initial score based on the multi-source biological information, the final score, $FS\left(i\right)$ , of a protein v_i from network can be calculated as bellow:

$FS\left(i\right) = \mathrm{\alpha }\sum _{j\in Nei\left(i\right)}{P}^{*}\left(i, j\right)FS\left(j\right)+(1-\alpha )IS\left(i\right)$

(17)

where, $Nei\left(i\right)$ shows the set of neighbor nodes of v_i.

As can be seen from Eq (17), a protein's final score may be thought of as a linear combination of its multi-source bioinformatics mark and its neighboring correlation mark. Among them, the percentage of these two scores are adjusted using parameter a. When a is equal to 0, the final protein score is only related to the multi-source biological information score, and when the value of a is 1, the score is only related to the common neighbor properties of a protein. However, the amount of protein in the network is numerous and they have great computational complexity. Therefore, we can rewrite Eq (17) into the form of a matrix vector:

$FS\left(i\right) = \mathrm{\alpha }\mathrm{*}{P}^{*}*FS+(1-\alpha )*IS$

(18)

Finally, the Jacobi iterative method can be used to quantitatively solve the Eq (18):

${FS}^{t} = \mathrm{\alpha }\mathrm{*}{P}^{*}*{FS}^{t-1}+(1-\alpha )*IS$

(19)

The number of iterations is represented by t = (0, 1, 2, …).

3. Expeiments and results

3.1. Experimental data source

The validity of the IEPMSF model was evaluated by using the basic data of essential protein. The dataset incorporates essential protein dataset, PPI network dataset, protein homology information dataset, and subcellular location dataset. The benchmark essential proteins involved in the datasets are 1199 essential proteins, mainly from databases of MIPS ^[23], SGD ^[24], DEG ^[25], and SGDP ^[26]. The DIP ^[27] database is used to get the PPI network data. Excluding repeated protein interactions and the protein itself interactions, there are 5093 proteins and 24,743 interactions in the collection. The subcellular location data was downloaded from the COMPARTMENTS ^[28] database, which integrates MGD ^[29], SGD ^[24], UniProtKB ^[30], WormBase ^[31] and FlyBase ^[32] databases and eventually obtains 3923 proteins with subcellular location information. The homologous protein data is gathered from the InParanoid database's 7th edition ^[33], which included pair-wise comparisons of entire genomes of 99 eukaryotes and 1 prokaryote.

To determine the significance of proteins in the protein interaction network, proteins are compared with results derived by the algorithm IEPMSF or other existing ways, DC ^[2], IC ^[3], CC ^[4], BC ^[5], SC ^[6], NC ^[7], PeC ^[9], CoEWC ^[8], POEM ^[10], FDP ^[11] and JDC ^[12] for example.

3.1.1. Influence of the parameter a on the capability of the IEPMSF method

In IEPMSF, the ordering score of the proteins are different depending on the a. To study the impact of parameter a on the capability of IEPMSF method, we experimented with several values ranging from 0 to 1 to see how they affected the accuracy of essential proteins prediction of IEPMSF. Table 1 contains detailed experimental data. The range of essential candidates selected is from the top 100 to the top 600. The ratio of actually essential proteins predicted determines predictive accuracy.

Table 1. Influence of the parameter a on the accuracy of IEPMSF prediction.

a	Top 100	Top 200	Top 300	Top 400	Top 500	Top 600
0	78.00%	77.00%	73.70%	72.30%	67.00%	63.00%
0.1	97.00%	84.50%	78.00%	74.00%	68.40%	64.50%
0.2	92.00%	85.50%	79.70%	74.50%	69.80%	65.30%
0.3	89.00%	86.00%	78.30%	72.80%	69.20%	64.80%
0.4	87.00%	83.00%	76.00%	71.80%	68.60%	65.00%
0.5	87.00%	78.00%	74.00%	70.00%	67.20%	64.30%
0.6	86.00%	77.00%	71.30%	69.00%	64.80%	63.00%
0.7	85.00%	75.00%	69.00%	66.80%	63.80%	60.00%
0.8	82.00%	74.00%	67.30%	64.50%	62.00%	59.20%
0.9	83.00%	75.00%	65.30%	62.80%	59.60%	57.30%
1	81.00%	71.00%	64.70%	59.80%	55.80%	53.20%

| Show Table

DownLoad: CSV

As shown in Table 1, it is shown that when a = 0, the predicted essential protein only considers the direct homology of the protein, while when a = 1, the predicted essential protein only considers the co-neighbor information. When a = 0 or a = 1, the IEPMSF performs worse than the values of 0 to 1. This means that combination of the direct homologues of proteins and their neighbours can predict the required proteins more accurately than if only one of these properties is considered. To compare with other algorithms, as a = 0.1, when the top 100 ranking proteins are chosen as essential protein candidates, the accuracy can reach 0.97, as shown in the experimental findings in Figure 2.

Figure 2. Number of true essential proteins predicted by DC, IC, SC, BC, CC, NC, PeC, CoEWC, POEM, JDC, FDP and IEPMSF, when top 100 ranked proteins as candidates and a = 0.1.

DownLoad: Full-Size Img PowerPoint

When essential protein candidates with higher scores at different ratios (top 100,200,300,400,500, and 600) are chosen, their highest values are 97% (a = 0.1), 86% (a = 0.3), 79.7% (a = 0.2), 74.5% (a = 0.2), 69.8% (a = 0.2) and 65.3% (a = 0.2) respectively. The maximum level of accuracy is centered at a = 0.2 as the number of candidate proteins grows. Therefore, we set a as 0.2 to carry out the following experiments.

3.1.2. The precision-recall curve (PR curve) analysis predicted by various methods

The PR curve is applied to further validate the capability of the various approaches. Firstly, according to the final scores computed by each technique, proteins in the protein interaction network are sorted in descending order. The preceding K proteins are considered essential proteins (positive dataset), whereas the remaining proteins are considered non-essential proteins (negative dataset), where the threshold K ranges from 1 to 5093. As the K values be changed, to produce the PR curve, the corresponding precision and recall values for each approach are computed, as illustrated in Figure 3. The PR curves of IEPMSF are compared with PR curves of centrality algorithms (DC, IC, CC, BC, SC, and NC) and of multi-source information fusion methods (PeC, CoEWC, POEM, JDC, and FDP) in Figure 3(a) and (b) respectively. As seen in Figure 3, the PR curve of the IEPMSF has much higher value than that of other algorithms.

Figure 3. Compared IEPMSF with other eleven approaches from the point of PR curves. (a) Curves of DC, IC, SC BC, CC, NC and IEPMSF; (b) Curves of PeC, CoEWC, POEM, JDC, FDP and IEPMSF.

DownLoad: Full-Size Img PowerPoint

3.1.3. The jackknife curve analysis predicted by various methods

To further examine the prediction performance of IEPMSF and other approaches, we apply the jackknife method. Figure 4 depicts the experimental outcomes. The number of putative essential proteins ranked first by each approach is represented on X-axis and the real number of important proteins found is represented on Y-axis. Performance of each method is compared in the area below the center line. Figure 4(a) demonstrate the outcome of a comparison between DC, IC, CC, BC, SC, NC and IEPMSF. From Figure 4(a), we see that the IEPMSF prediction of essential proteins is significantly more accurate than that of NC. Figure 4(b) shows the comparison of IEPMSF and existing methods based on multi-source information fusion (PeC, CoEWC, POEM, JDC and FDP). According to all of the experimental data. the accuracy of IEPMSF in predicting essential proteins is greater than the other 11 approaches, according to all of the experimental data.

Figure 4. Compared IEPMSF with other eleven approaches from the point of jackknife curves. (a) Curves of DC, IC, SC BC, CC, NC and IEPMSF; (b) Curves of PeC, CoEWC, POEM, JDC, FDP and IEPMSF.

DownLoad: Full-Size Img PowerPoint

4. Conclusions and discussion

The essential proteins identifying is not only a prerequisite in comprehending organism survival, but it is also critical for the discovery of disease-causing genes and possible therapeutic targets. An essential proteins identification model IEPMSF was designed in this paper. In order to avoid more noise caused by multi-source data integration, to build the weighted network, the model only uses the common neighbor topology properties of the nodes in the network from original PPI data. Considering the issue of false positive and false negative PPI data caused by high-throughput trials, and the clustering function of NMTF, the weighted network was optimized using the non-negative matrix symmetric tri-factorization (NMSTF) technique to uncover probable protein-protein interactions. Finally, the starting score of each protein node was calculated using the subcellular location and homologous proteins information, and the restart random walk method was used to score and rank each protein in the network. Compared with the topological centrality method and the traditional multi-source information integration method, the experimental findings reveal that the suggested essential proteins prediction approach, IEPMSF, significantly improves the performance of essential proteins prediction. On the basis of the existing work, how to design a more effective method to construct a weighted network based on multi-source information integration is the future research direction of essential proteins identification. In long term, we will investigate including more biological data during the weighted network construction step, and try to apply the model to other species.

Acknowledgments

This project is partially funded by the National Natural Science Foundation of China (61772089, 62006030), Natural Science Foundation of Hunan Province (2020JJ4648), Major Scientific and Technological Projects for collaborative prevention and control of birth defects in Hunan Province (2019SK1010).

Conflict of interest

The authors declare no competing interests.

References

[1]	M. Kantarcıoglu, J. Vaidya, C. Clifton, Privacy preserving Naive Bayes classifier for horizontally partitioned data, IEEE ICDM Workshop on Privacy Preserving Data Mining, 2003, 3–9.
[2]	J. Vaidya, C. W. Clifton, Y. M. Zhu, Privacy-preserving data mining, Vol. 19, New York: Springer, 2006. https://doi.org/10.1007/978-0-387-29489-6
[3]	P. Mohassel, Y. Zhang, SecureML: a system for scalable privacy-preserving machine learning, 2017 IEEE symposium on security and privacy (SP), San Jose, CA, USA, 2017, 19–38. https://doi.org/10.1109/SP.2017.12
[4]	M. S. Riazi, C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, F. Koushanfar, Chameleon: a hybrid secure computation framework for machine learning applications, ASIACCS '18: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, 2018,707–721. https://doi.org/10.1145/3196494.3196522
[5]	C. Juvekar, V. Vaikuntanathan, A. Chandrakasan, GAZELLE: a low latency framework for secure neural network inference, SEC'18: Proceedings of the 27th USENIX Conference on Security Symposium, 2018, 1651–1669.
[6]	M. S. Riazi, M. Samragh, H. Chen, K. Laine, K. Lauter, F. Koushanfar, XONN: XNOR-based oblivious deep neural network inference, SEC'19: Proceedings of the 28th USENIX Conference on Security Symposium, 2019, 1501–1518.
[7]	R. Agrawal, R. Srikant, Privacy-preserving data mining, ACM SIGMOD Record, 2000,439–450. https://doi.org/10.1145/335191.335438 doi: 10.1145/335191.335438
[8]	S. De Hoogh, B. Schoenmakers, P. Chen, H. op den Akker, Practical secure decision tree learning in a teletreatment application, In: N. Christin, R. Safavi-Naini, Financial cryptography and data security, FC 2014, Berlin, Heidelberg: Springer, 8437 (2014), 179–194. https://doi.org/10.1007/978-3-662-45472-5_12
[9]	C. Choudhary, M. De Cock, R. Dowsley, A. Nascimento, D. Railsback, Secure training of extra trees classifiers over continuous data, AAAI-20 Workshop on Privacy-Preserving Artificial Intelligence, 2020.
[10]	M. Abspoel, D. Escudero, N. Volgushev, Secure training of decision trees with continuous attributes, Proc. Priv. Enhancing Technol., 2021 (2021), 167–187. https://doi.org/10.2478/popets-2021-0010 doi: 10.2478/popets-2021-0010
[11]	V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, N. Taft, Privacy-preserving ridge regression on hundreds of millions of records, 2013 IEEE Symposium on Security and Privacy, 2013,334–348. https://doi.org/10.1109/SP.2013.30
[12]	M. de Cock, R. Dowsley, A. C. A. Nascimento, S. C. Newman, Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data, AISec '15: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, 2015, 3–14. https://doi.org/10.1145/2808769.2808774
[13]	A. Agarwal, R. Dowsley, N. D. McKinney, D. Wu, C. T. Lin, M. De Cock, et al., Protecting privacy of users in brain-computer interface applications, IEEE Transactions on Neural Systems and Rehabilitation Engineering, 27 (2019), 1546–1555. https://doi.org/10.1109/TNSRE.2019.2926965 doi: 10.1109/TNSRE.2019.2926965
[14]	H. Chen, R. Gilad-Bachrach, K. Han, Z. Huang, A. Jalali, K. Laine, et al., Logistic regression over encrypted data from fully homomorphic encryption, BMC Med. Genomics, 11 (2018), 81. https://doi.org/10.1186/s12920-018-0397-z doi: 10.1186/s12920-018-0397-z
[15]	S. Truex, L. Liu, M. E. Gursoy, L. Yu, Privacy-preserving inductive learning with decision trees, 2017 IEEE International Congress on Big Data (BigData Congress), 2017, 57–64. https://doi.org/10.1109/BigDataCongress.2017.17
[16]	M. E. Skarkala, M. Maragoudakis, S. Gritzalis, L. Mitrou, PPDM-TAN: a privacy-preserving multi-party classifier, Computation, 9 (2021), 6. https://doi.org/10.3390/computation9010006 doi: 10.3390/computation9010006
[17]	N. Agrawal, A. S. Shamsabadi, M. J. Kusner, A. Gascón, QUOTIENT: two-party secure neural network training and prediction, CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019, 1231–1247. https://doi.org/10.1145/3319535.3339819
[18]	S. Wagh, D. Gupta, N. Chandran, SecureNN: 3-party secure computation for neural network training, Proc. Priv. Enhancing Technol., 2019 (2019), 26–49. https://doi.org/10.2478/popets-2019-0035 doi: 10.2478/popets-2019-0035
[19]	C. Guo, A. Hannun, B. Knott, L. van der Maaten, M. Tygert, R. Zhu, Secure multiparty computations in floating-point arithmetic, arXiv, 2020. https://doi.org/10.48550/arXiv.2001.03192
[20]	M. De Cock, R. Dowsley, A. C. A. Nascimento, D. Railsback, J. Shen, A. Todoki, High performance logistic regression for privacy-preserving genome analysis, BMC Med. Genomics, 14 (2021), 23. https://doi.org/10.1186/s12920-020-00869-9 doi: 10.1186/s12920-020-00869-9
[21]	Y. Fan, J. Bai, X. Lei, W. Lin, Q. Hu, G. Wu, et al., PPMCK: privacy-preserving multi-party computing for k-means clustering, J. Parallel Distr. Com., 154 (2021), 54–63. https://doi.org/10.1016/j.jpdc.2021.03.009 doi: 10.1016/j.jpdc.2021.03.009
[22]	Y. Lindell, B. Pinkas, Privacy preserving data mining, In: M. Bellare, Advances in cryptology–CRYPTO 2000, Lecture Notes in Computer Science, Berlin, Heidelberg: Springer, 1880 (2000), 36–54. https://doi.org/10.1007/3-540-44598-6_3
[23]	E. Yilmaz, M. Al-Rubaie, J. M. Chang, Naive Bayes classification under local differential privacy, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), 2020,709–718. https://doi.org/10.1109/DSAA49011.2020.00081
[24]	H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacy preserving properties of random data perturbation techniques, Third IEEE International Conference on Data Mining, 2003, 99–106. https://doi.org/10.1109/ICDM.2003.1250908
[25]	R. Bost, R. A. Popa, S. Tu, S. Goldwasser, Machine learning classification over encrypted data, NDSS, 2015. https://doi.org/10.14722/ndss.2015.23241 doi: 10.14722/ndss.2015.23241
[26]	A. Wood, V. Shpilrain, K. Najarian, A. Mostashari, D. Kahrobaei, Private-key fully homomorphic encryption for private classification, In: J. Davenport, M. Kauers, G. Labahn, J. Urban, Mathematical Software–ICMS 2018, Cham: Springer, 10931 (2018), 475–481. https://doi.org/10.1007/978-3-319-96418-8_56
[27]	S. C. Rambaud, J. Hernandez-Perez, A naive justification of hyperbolic discounting from mental algebraic operations and functional analysis, Quant. Financ. Econ., 7 (2023), 463–474. https://doi.org/10.3934/QFE.2023023 doi: 10.3934/QFE.2023023
[28]	G. A. Tsiatsios, J. Leventides, E. Melas, C. Poulios, A bounded rational agent-based model of consumer choice, Data Sci. Financ. Econ., 3 (2023), 305–323. https://doi.org/10.3934/DSFE.2023018 doi: 10.3934/DSFE.2023018
[29]	Z. Li, Z. Huang, Y. Su, New media environment, environmental regulation and corporate green technology innovation: evidence from china, Energy Econ., 119 (2023), 106545. https://doi.org/10.1016/j.eneco.2023.106545 doi: 10.1016/j.eneco.2023.106545
[30]	X. Sun, P. Zhang, J. K. Liu, J. Yu, W. Xie, Private machine learning classification based on fully homomorphic encryption, IEEE Transactions on Emerging Topics in Computing, 8 (2018), 352–364. https://doi.org/10.1109/TETC.2018.2794611 doi: 10.1109/TETC.2018.2794611
[31]	A. Kjamilji, E. Savaş, A. Levi, Efficient secure building blocks with application to privacy preserving machine learning algorithms, IEEE Access, 9 (2021), 8324–8353. https://doi.org/10.1109/ACCESS.2021.3049216 doi: 10.1109/ACCESS.2021.3049216
[32]	A. Khedr, G. Gulak, V. Vaikuntanathan, Shield: scalable homomorphic implementation of encrypted data-classifiers, IEEE Transactions on Computers, 65 (2015), 2848–2858. https://doi.org/10.1109/TC.2015.2500576 doi: 10.1109/TC.2015.2500576
[33]	N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig, John Wernsing, Cryptonets: applying neural networks to encrypted data with high throughput and accuracy, ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, 48 (2016), 201–210.
[34]	S. Kim, M. Omori, T. Hayashi, T. Omori, L. Wang, S. Ozawa, Privacy-preserving naive Bayes classification using fully homomorphic encryption, In: L. Cheng, A. Leung, S. Ozawa, Neural Information Processing, ICONIP 2018, Cham: Springer, 11304 (2018), 349–358. https://doi.org/10.1007/978-3-030-04212-7_30
[35]	D. H. Vu, Privacy-preserving Naive Bayes classification in semi-fully distributed data model, Comput. Secur., 115 (2022), 102630. https://doi.org/10.1016/j.cose.2022.102630 doi: 10.1016/j.cose.2022.102630
[36]	D. H. Vu, T. S. Vu, T. D. Luong, An efficient and practical approach for privacy-preserving Naive Bayes classification, J. Inf. Secur. Appl., 68 (2022), 103215. https://doi.org/10.1016/j.jisa.2022.103215 doi: 10.1016/j.jisa.2022.103215
[37]	P. Li, J. Li, Z. Huang, C. Z. Gao, W. B. Chen, K. Chen, Privacy-preserving outsourced classification in cloud computing, Cluster Comput., 21 (2018), 277–286. https://doi.org/10.1007/s10586-017-0849-9 doi: 10.1007/s10586-017-0849-9
[38]	C. Gentry, Fully homomorphic encryption using ideal lattices, STOC '09: Proceedings of the forty-first annual ACM symposium on Theory of computing, 2009,169–178. https://doi.org/10.1145/1536414.1536440
[39]	X. Yi, Y. Zhang, Privacy-preserving Naive Bayes classification on distributed data via semi-trusted mixers, Inf. Syst., 34 (2009), 371–380. https://doi.org/10.1016/j.is.2008.11.001 doi: 10.1016/j.is.2008.11.001
[40]	A. C. Yao, Protocols for secure computations, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982), 1982,160–164. https://doi.org/10.1109/SFCS.1982.38
[41]	T. Elgamal, A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Transactions on Information Theory, 31 (1985), 469–472. https://doi.org/10.1109/TIT.1985.1057074 doi: 10.1109/TIT.1985.1057074
[42]	P. Paillier, Public-key cryptosystems based on composite degree residuosity classes, In: J. Stern, Advances in cryptology–EUROCRYPT '99, Lecture Notes in Computer Science, Berlin, Heidelberg: Springer, 1592 (1999), 223–238. https://doi.org/10.1007/3-540-48910-X_16
[43]	S. Goldwasser, S. Micali, Probabilistic encryption, J. Comput. Syst. Sci., 28 (1984), 270–299. https://doi.org/10.1016/0022-0000(84)90070-9 doi: 10.1016/0022-0000(84)90070-9
[44]	W. Henecka, S. Kögl, A. R. Sadeghi, T. Schneider, I. Wehrenberg, TASTY: tool for automating secure two-party computations, CCS '10: Proceedings of the 17th ACM conference on Computer and communications security, 2010,451–462. https://doi.org/10.1145/1866307.1866358
[45]	A. Ben-David, N. Nisan, B. Pinkas, FairplayMP: a system for secure multi-party computation, CCS '08: Proceedings of the 15th ACM conference on Computer and communications security, 2008,257–266. https://doi.org/10.1145/1455770.1455804
[46]	X. Liu, R. H. Deng, K. K. R. Choo, Y. Yang, Privacy-preserving outsourced support vector machine design for secure drug discovery, IEEE Transactions on Cloud Computing, 8 (2020), 610–622. https://doi.org/10.1109/TCC.2018.2799219 doi: 10.1109/TCC.2018.2799219
[47]	X. Yi, Y. Zhang, Privacy-preserving Naive Bayes classification on distributed data via semi-trusted mixers, Inf. Syst., 34 (2009), 371–380. https://doi.org/10.1016/j.is.2008.11.001 doi: 10.1016/j.is.2008.11.001
[48]	H. Park, P. Kim, H. Kim, K. W. Park, Y. Lee, Efficient machine learning over encrypted data with non-interactive communication, Comput. Stand. Inter., 58 (2018), 87–108. https://doi.org/10.1016/j.csi.2017.12.004 doi: 10.1016/j.csi.2017.12.004
[49]	X. Liu, R. H. Deng, K. K. R. Choo, Y. Yang, Privacy-preserving outsourced clinical decision support system in the cloud, IEEE Transactions on Services Computing, 14 (2017), 222–234. https://doi.org/10.1109/TSC.2017.2773604 doi: 10.1109/TSC.2017.2773604
[50]	R. Podschwadt, D. Takabi, P. Hu, M. H. Rafiei, Z. Cai, A survey of deep learning architectures for privacy-preserving machine learning with fully homomorphic encryption, IEEE Access, 10 (2022), 117477–117500. https://doi.org/10.1109/ACCESS.2022.3219049 doi: 10.1109/ACCESS.2022.3219049
[51]	D. Beaver, One-time tables for two-party computation, In: W. L. Hsu, M. Y. Kao, Computing and combinatorics, COCOON 1998, Berlin, Heidelberg: Springer, 1449 (1998), 361–370. https://doi.org/10.1007/3-540-68535-9_40
[52]	M. De Cock, R. Dowsley, C. Horst, R. Katti, A. C. A. Nascimento, W. S. Poon, et al., Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation, IEEE Transactions on Dependable and Secure Computing, 16 (2017), 217–230. https://doi.org/10.1109/TDSC.2017.2679189 doi: 10.1109/TDSC.2017.2679189
[53]	D. Reich, A. Todoki, R. Dowsley, M. De Cock, A. Nascimento, Privacy-preserving classification of personal text messages with secure multi-party computation, In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett, Advances in neural information processing systems 32, 2019, 3752–3764.
[54]	A. Resende, D. Railsback, R. Dowsley, A. C. A. Nascimento, D. F. Aranha, Fast privacy-preserving text classification based on secure multiparty computation, IEEE Transactions on Information Forensics and Security, 17 (2022), 428–442. https://doi.org/10.1109/TIFS.2022.3144007 doi: 10.1109/TIFS.2022.3144007
[55]	Y. Yasumura, Y. Ishimaki, H. Yamana, Secure Naïve Bayes classification protocol over encrypted data using fully homomorphic encryption, iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, 2019, 45–54. https://doi.org/10.1145/3366030.3366056
[56]	R. Canetti, Universally composable security: a new paradigm for cryptographic protocols, Proceedings 42nd IEEE Symposium on Foundations of Computer Science, 2001,136–145. https://doi.org/10.1109/SFCS.2001.959888
[57]	Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278–2324. https://doi.org/10.1109/5.726791 doi: 10.1109/5.726791

This article has been cited by:

Yane Li, Chengfeng Wang, Haibo Gu, Hailin Feng, Yaoping Ruan, ESMDNN-PPI: a new protein–protein interaction prediction model developed with protein language model of ESM2 and deep neural network, 2024, 35, 0957-0233, 125701, 10.1088/1361-6501/ad761c

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.1

Metrics

Article views(1853) PDF downloads(82) Cited by(2)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(3)

AIMS Mathematics

Privacy-preserving Naive Bayes classification based on secure two-party computation

Related Papers:

Abstract

1. Introduction

2. Materials and methods

2.1. Basic framework of the model

2.1.1. Weighted network construction module

2.1.2. Weighted network optimization module

2.1.3. Proteins scoring and sorting module

3. Expeiments and results

3.1. Experimental data source

3.1.1. Influence of the parameter a on the capability of the IEPMSF method

3.1.2. The precision-recall curve (PR curve) analysis predicted by various methods

3.1.3. The jackknife curve analysis predicted by various methods

4. Conclusions and discussion

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

Privacy-preserving Naive Bayes classification based on secure two-party computation

Related Papers:

Abstract

1. Introduction

2. Materials and methods

2.1. Basic framework of the model

2.1.1. Weighted network construction module

2.1.2. Weighted network optimization module

2.1.3. Proteins scoring and sorting module

3. Expeiments and results

3.1. Experimental data source

3.1.1. Influence of the parameter a on the capability of the IEPMSF method

3.1.2. The precision-recall curve (PR curve) analysis predicted by various methods

3.1.3. The jackknife curve analysis predicted by various methods

4. Conclusions and discussion

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog