
Multiscale modelling is a promising quantitative approach for studying infectious disease dynamics. This approach garners attention from both individuals who model diseases and those who plan for public health because it has great potential to contribute in expanding the understanding necessary for managing, reducing, and potentially exterminating infectious diseases. In this article, we developed a nested multiscale model of hepatitis B virus (HBV) that integrates the within-cell scale and the between-cell scale at cell level of organization of this disease system. The between-cell scale is linked to the within-cell scale by a once off inflow of initial viral infective inoculum dose from the between-cell scale to the within-cell scale through the process of infection; the within-cell scale is linked to the between-cell scale through the outflow of the virus from the within-cell scale to the between-cell scale through the process of viral shedding or excretion. The resulting multiple scales model is bidirectionally coupled in such a way that the within-cell scale and between-cell scale sub-models mutually affect each other, creating a reciprocal relationship. The computed reproductive number from the multiscale model confirms that the within-host scale and the between-host scale influence each other in a reciprocal manner. Numerical simulations are presented that also confirm the theoretical results and support the initial assumption that the within-cell scale and the between-cell scale influence each other in a reciprocal manner. This multiple scales modeling approach serves as a valuable tool for assessing the impact and success of health strategies aimed at controlling hepatitis B virus disease system.
Citation: Huguette Laure Wamba Makeng, Ivric Valaire Yatat-Djeumen, Bothwell Maregere, Rendani Netshikweta, Jean Jules Tewa, Winston Garira. Multiscale modelling of hepatitis B virus at cell level of organization[J]. Mathematical Biosciences and Engineering, 2024, 21(9): 7165-7193. doi: 10.3934/mbe.2024317
[1] | Faiz Ul Islam, Guangjie Liu, Weiwei Liu . Identifying VoIP traffic in VPN tunnel via Flow Spatio-Temporal Features. Mathematical Biosciences and Engineering, 2020, 17(5): 4747-4772. doi: 10.3934/mbe.2020260 |
[2] | Xiaolin Gui, Yuanlong Cao, Ilsun You, Lejun Ji, Yong Luo, Zhenzhen Luo . A Survey of techniques for fine-grained web traffic identification and classification. Mathematical Biosciences and Engineering, 2022, 19(3): 2996-3021. doi: 10.3934/mbe.2022138 |
[3] | Jin Wang, Liping Wang, Ruiqing Wang . MFFLR-DDoS: An encrypted LR-DDoS attack detection method based on multi-granularity feature fusions in SDN. Mathematical Biosciences and Engineering, 2024, 21(3): 4187-4209. doi: 10.3934/mbe.2024185 |
[4] | Shivani Gaba, Ishan Budhiraja, Vimal Kumar, Aaisha Makkar . Advancements in enhancing cyber-physical system security: Practical deep learning solutions for network traffic classification and integration with security technologies. Mathematical Biosciences and Engineering, 2024, 21(1): 1527-1553. doi: 10.3934/mbe.2024066 |
[5] | Zilong Liu, Jingbing Li, Jing Liu . Encrypted face recognition algorithm based on Ridgelet-DCT transform and THM chaos. Mathematical Biosciences and Engineering, 2022, 19(2): 1373-1387. doi: 10.3934/mbe.2022063 |
[6] | Chenhao Wu, Lei Chen . A model with deep analysis on a large drug network for drug classification. Mathematical Biosciences and Engineering, 2023, 20(1): 383-401. doi: 10.3934/mbe.2023018 |
[7] | Kaimeng Chen, Chin-Chen Chang . High-capacity reversible data hiding in encrypted images based on two-phase histogram shifting. Mathematical Biosciences and Engineering, 2019, 16(5): 3947-3964. doi: 10.3934/mbe.2019195 |
[8] | Shufen Niu, Wei Liu, Sen Yan, Qi Liu . Message sharing scheme based on edge computing in IoV. Mathematical Biosciences and Engineering, 2023, 20(12): 20809-20827. doi: 10.3934/mbe.2023921 |
[9] | Yiqin Bao, Qiang Zhao, Jie Sun, Wenbin Xu, Hongbing Lu . An edge cloud and Fibonacci-Diffie-Hellman encryption scheme for secure printer data transmission. Mathematical Biosciences and Engineering, 2024, 21(1): 96-115. doi: 10.3934/mbe.2024005 |
[10] | Chunkai Zhang, Ao Yin, Wei Zuo, Yingyang Chen . Privacy preserving anomaly detection based on local density estimation. Mathematical Biosciences and Engineering, 2020, 17(4): 3478-3497. doi: 10.3934/mbe.2020196 |
Multiscale modelling is a promising quantitative approach for studying infectious disease dynamics. This approach garners attention from both individuals who model diseases and those who plan for public health because it has great potential to contribute in expanding the understanding necessary for managing, reducing, and potentially exterminating infectious diseases. In this article, we developed a nested multiscale model of hepatitis B virus (HBV) that integrates the within-cell scale and the between-cell scale at cell level of organization of this disease system. The between-cell scale is linked to the within-cell scale by a once off inflow of initial viral infective inoculum dose from the between-cell scale to the within-cell scale through the process of infection; the within-cell scale is linked to the between-cell scale through the outflow of the virus from the within-cell scale to the between-cell scale through the process of viral shedding or excretion. The resulting multiple scales model is bidirectionally coupled in such a way that the within-cell scale and between-cell scale sub-models mutually affect each other, creating a reciprocal relationship. The computed reproductive number from the multiscale model confirms that the within-host scale and the between-host scale influence each other in a reciprocal manner. Numerical simulations are presented that also confirm the theoretical results and support the initial assumption that the within-cell scale and the between-cell scale influence each other in a reciprocal manner. This multiple scales modeling approach serves as a valuable tool for assessing the impact and success of health strategies aimed at controlling hepatitis B virus disease system.
Various studies are being conducted on privacy protection in the field of computer communication [1,2,3]. It is known that when plaintext is transmitted without encryption, it can be eavesdropped on and intercepted. Hence, encrypted communication protocols are being used to protect privacy by encrypting transmission data. In 2017, Gartner predicted that more than 80% of web traffic can be encrypted [4]. According to the statistics reported as of August 2022, more than 80% of web pages were loaded as Hypertext Transfer Protocol Secure (HTTPS) over Transport Layer Security (TLS) protocols [5].
Encryption communication protocols protect transmission data using encryption algorithms. Clients and servers negotiate the encryption methods and share encryption keys before transmitting the data. After the negotiation, the data are encrypted and transmitted. This method ensures that the confidentiality of the data is maintained by making it impossible to know the content, even if a third- party attempts to eavesdrop and intercept the data. However, encryption communication protocols can be used for malicious purposes, thus posing cybersecurity issues. Cisco expected that by 2021, more than 70% of web malware would encrypt traffic, whereas 60% of organizations would fail to detect web malware traffic [6].
Attackers threaten cybersecurity by exploiting the fact that third parties are incapable of accessing the contents of transmission data when encryption communication protocols are used, as shown in Figure 1.
In general, the network security equipment is installed in the actual network path or replicates communication packets during traffic inspection. However, the data being transmitted cannot be decrypted if the traffic is encrypted. Therefore, the network security equipment cannot check the contents and take necessary steps based on analyses. Hence, attackers exploit the limited visibility of network security equipment to conduct cyberattacks such as leakage of internal information assets or sending of triggers to malicious codes to penetrate the internal network. It is, therefore, difficult to specify the attempt and scope of the attacks that use encrypted traffic, posing a significant threat.
To manage the risk of encryption communication protocol abuse, decryption must be performed to ensure the visibility of encrypted traffic data and for performing inspections. However, decrypting data for inspection risks privacy infringement while incurring additional computing resources and decryption time. In addition, the time delay between the client and server due to decryption and inspection increases, which can reduce the quality of the user experience. To overcome these disadvantages, it is necessary to classify and analyze the encrypted traffic without using a decryption process. To this end, network fingerprinting technology is one of the most promising candidates.
Traditional network fingerprinting techniques have been used to perform OS identification using header information in the TCP/IP stack or identify network topology using IP address information [7]. The procedure depends on the existing network environment configuration, on-premises server, and network configuration. However, with the advent of modern cloud computing and software-defined network technologies, network fingerprinting is expected to be less effective as the boundaries of the network become blurred and the configuration of services that are not dependent on traditional network address schemes increases. In addition, traffic-type classifiers utilizing traditional network fingerprinting operate based on the TCP/IP packet length [8]. The padding of encryption algorithms used in cryptographic communication protocols is expected to act as noise obstructing the input of traffic-type classifiers, thus reducing classifier accuracy. TLS fingerprinting is emerging as a method to compensate for these shortcomings.
TLS fingerprinting uses the handshake and header information of the TLS stack and encrypted application data to identify client applications and web servers and identify and classify the types of content sent. Instead of relying on IP addresses, it generates fingerprints by extracting specific information (such as encryption algorithm chutes and extensions) characterized by a TLS handshake. It also targets encrypted packets to extract features and generate classifiers to respond to encrypted traffic. This technology can be used alone and in conjunction with technology and other network fingerprinting technologies in the existing TCP/IP stack to improve accuracy. Cybersecurity, therefore, requires a broad understanding of TLS fingerprinting to respond to changing networks.
This study provides a broad overview of encrypted network traffic fingerprinting techniques that analyze encrypted traffic without decryption. Fingerprinting is used to classify and identify client applications or servers. Fingerprints are generated through non-encrypted data from cryptographic negotiation (handshake) messages. They are then compared against fingerprint databases, or machine learning, artificial intelligence (AI), and statistical techniques are applied to a large number of datasets to generate identifiers and classifiers.
The contributions of this study are divided into two categories. First, encrypted network traffic fingerprinting techniques are investigated, and several fingerprint generation and accuracy improvement methods are described. Second, the investigation results are analyzed by organizing the identifier and classifier generation techniques. They are categorized and described based on the method and features used for identification and classification. Lastly, journals, conferences, research papers, and internet documents are surveyed. The findings are classified and described using the taxonomy shown in Figure 2.
The manuscript consists of six sections. Section 2 explains the background information required for understanding the fingerprinting techniques. Section 3 describes encrypted traffic fingerprint techniques and studies based on them. Section 4 discusses identifier and classifier generation techniques. The last section concludes the paper by presenting the results and future scope of the current research.
Our work provides extensive information on TLS fingerprinting gained through investigation and analysis. Prior to the investigation of detailed techniques, we discuss them in relation to a survey study of techniques targeting encrypted traffic [9,10,11,12]. Survey studies have been performed for classification [9,10], detection [12], and analysis [11] as problem domains. Among them, Refs. [9,10,11] focus on techniques using machine learning. In addition, we refer to relevant studies [13,14,15,16] for explanations of techniques using privacy, graph networks, time series data, etc., discussed in the paper. We summarize the related survey papers below, with Table 1 providing a comparison of the problem domains, methods, and protocols of the survey papers described.
Survey | Problem domains | Method | Protocols |
[9] | Classification | Focus on ML-based | Various |
[10] | Classification | ML-based | Various |
[11] | Detection | Focus on ML-based | TLS |
[12] | Analysis | Various | Various |
Present study | Fingerprinting | Various | TLS |
Velan et al. [9] discussed classification and analysis techniques for Internet Protocol Security, TLS, Secure Shell Protocol, BitTorrent, and Skype protocol traffic. The classification technique divided the information extraction steps into non-encrypted initialization steps and the encrypted data transfer steps into payload- and feature-based methods. Payload-based methods include techniques that perform pattern matching on payload, which is the transmission of data or operations based on information such as the payload size, port number, and IP address that can be obtained from packets without processing. Feature-based methods extract features from encrypted communication patterns and utilize maps, unsupervised methods, hybrid machine learning, and basic statistical analysis. Pacheco et al. [10] presented a study that systematically described techniques using machine learning for the classification of encrypted network traffic. Basic machine learning required for traffic classification and the representative workflows of traffic classification techniques using machine learning are discussed. Based on the workflow, an overview of the data collection, feature engineering, algorithm selection, and model deployment methods by category is provided.
Oh et al. [11] discussed methods used for analyzing malicious network traffic encrypted with TLS available at the Security Operation Center. Machine-learning-based algorithms and encryption traffic analysis techniques using middleboxes were mainly described. An overview of how TLS interception can be performed over middlebox without a secret key or through a machine learning pipeline, for passive inspection of TLS-encrypted traffic to perform analysis by sniffing traffic and extracting features or performing malware detection using TLS flow fingerprinting was provided.
Papadogiannaki and Ioannidis [12] described cryptographic network analysis applications, technologies, and countermeasures in four use cases. These use cases consist of analytics, security, user privacy, and middleboxes. Analytics identifies protocols and users, security detects malicious traffic, user privacy detects data leakage and fingerprinting, and middlebox performs a deep packet inspection for man-in-the-middle attacks. Each case provides an overview of the application, technology, and response measures adopted. The datasets used in each study are also mentioned.
However, the researchers did not specify or describe TLS fingerprinting, which can be utilized additionally or alone in the technology stack of network fingerprinting. Existing survey papers on TLS fingerprinting provide only partial information on its use as a cryptographic traffic analysis technique, insufficient for a broad understanding of TLS fingerprinting. This paper aims to provide a broad perspective on TLS fingerprinting techniques without being specific to the purpose, technique, or data type.
TLS is an encryption communication protocol designed to prevent eavesdropping, tampering, and message forgery of client-server data by providing end-to-end encryption. TLS 1.0, based on Secure Socket Layer (SSL) 3.0, achieved communication privacy over the internet [17]. When TLS 1.3 was announced in March 2021, TLS 1.0 and TLS 1.1 were discontinued by the Internet Engineering Task Force (IETF) [18]. Transmission data are protected by authenticating servers and clients using a certificate and public key (asymmetric key) algorithms and performing encryption algorithm negotiation and key exchange for transmission data encryption to encrypt data after negotiation [19,20,21,22]. The TLS structure is illustrated in Figure 3. It protects the data transmitted from application layer protocols (HTTP and FTP) to transport layer protocols (TCP and UDP payload). Traffic between communication peers is protected through record protocols, and handshakes sharing encryption specifications and keys for transmission data protection are conducted.
Key exchange, server parameter setting, and authentication are the three stages of a handshake, as shown in Figure 4. The Client/ServerHello messages sent during the key exchange phase are not encrypted. A handshake encryption key is shared to protect the messages during this phase. The handshake encryption key is used until an application encryption key is shared in the server parameters and authentication phases. The application data are encrypted using N number of application encryption keys generated through key sharing. The Client/ServerHello messages, which are non-encrypted data, are used to generate fingerprints using cipher suites and extensions.
A Markov chain is a stochastic probability model with a Markov property in which the probability of a particular event depends only on the state attained in the previous event [23]. A stochastic process is a set of probability variables whose state changes stochastically over a certain period. It can be termed as a collection of values that observe the state of an object over time. The probability process can be categorized into discrete and continuous time depending on the time of observation. A Markov chain generally refers to the discrete-time Markov process [24]. If Xn+1 is a specific state and Xn is a historical state, then the Markov chain can be expressed as
limx→a+f(x)=±∞P(Xn+1=j|Xn=i,Xn−1=in−1,…,X0=i0)=P(Xn+1=j|Xn=i). | (1) |
If P(Xn+1=j|Xn=i), which is a pair of (i,j), is expressed as pij, then a Markov chain with m states can be encoded as in Eq (2). The matrix is a two-dimensional transition probability matrix, where i, j,and pij represent the row, column, and transition probability elements of the matrix. Since the sum of all elements in each row is the sum of the probabilities of transition to a particular state, it is always 1.
(p11⋯p1m⋮⋱⋮pm1⋯pmm) | (2) |
A Markov chain can also be represented by a state transition graph using a set of states and a transition probability matrix. For example, Figure 5 illustrates a Markov chain with m=2 and pij=4
Markov processes and chains are mathematical techniques for analyzing state changes and state characteristics by approaching complex probability processes with simple assumptions using a set of states and transition probability matrices [25,26]. The state transition graph of a Markov chain can be used to generate fingerprints based on changes in the message type and state of encrypted communication traffic.
AI is a field of study based on the speculation that machines can be used to simulate all aspects of learning and other features of intelligence with accuracy as its principle [27]. AI, which aims to simulate intelligent human behavior, is widely used in the field of data engineering, power demand forecasting, etc. [28]. It aims for an accurate prediction in the form of machine learning systems.
Machine learning extracts features from a large number of data and then learns them into artificial neural networks to perform classification and regression on future inputs [29]. With the development of deep learning techniques that combine human neural networks with deep neural networks, research in various fields and use cases have emerged [30]. It is used to analyze encrypted traffic, such as extracting features from encrypted communication traffic, classifying positive/malicious software traffic, or identifying client applications and services.
A technique based on fingerprint collection generates fingerprints for encrypted traffic, stores them in a database, and performs fingerprinting by complete or approximate matching. Fingerprints are generated based on Client/ServerHello messages, statistical techniques, and behavior such as responding to a request message from a particular sequence. They are represented by strings, data formats (XML, JSON), state transition graphs, etc. Fingerprints that are forged, altered, or unregistered have the disadvantage of poor accuracy, as the technique relies on fingerprint information. Research on improving accuracy using statistics and AI techniques is being conducted.
This technique generates fingerprints by extracting values from ClientHello/ServerHello and messages sent and received during the key exchange phase. It is primarily a technique that identifies the processes of a client and the services provided by a server and compares them to a prebuilt fingerprint database used for fingerprinting.
Ristic [31] proposed a technique that distinguished client processes, taking advantage of the fact that the list of supported encryption algorithms on the server differed depending on the client using the SSL. Subsequently, several studies have proposed fingerprint generation methods using various ClientHello fields based on cipher suits [32,33,34,35]. The fields used are presented in Table 2.
ClientHello field | [31] | [32] | [33] | [34] | [35] |
SSL/TLS record version | √ | √ | |||
SSL/TLS version | √ | √ | √ | √ | |
Cipher suites length | √ | ||||
Cipher suites | √ | √ | √ | √ | √ |
Extensions | √ | √ | √ | √ | |
Data of extensions | √ | ||||
Flag | √ | ||||
Compression length | √ | ||||
Compression | √ | ||||
Server name* | √ | ||||
Elliptic curves* | √ | ||||
Elliptic curve point formats* | √ |
Cipher suites are commonly used for fingerprinting while extensions are used with SSL fingerprinting for p0f [32]. Brotherston [33] presented a demonstration of identifying real-world processes by adding various fields. Althouse [34] discussed JA3 fingerprint, which uses extension and elliptic curve algorithm information for fingerprint generation to simplify the use field and for discrimination. Fingerprint generation codes and methods were shared as open sources to broaden the base. JA3 fingerprint is also used as pulse information in Open Threat Exchange, a threat intelligence-sharing system. An example of a JA3 fingerprint obtained using the ClientHello message is shown in Figure 6. The network flow data capturing and aliasing package named Joy by Cisco uses TLS fingerprints [35]. Similar to the JA3 fingerprint, these fingerprints do not extract the contents of a specific extension for fingerprint usage, but the data of the extension can be included in the fingerprint. Although this technique has the advantages of fast identification speed and ease of use, fingerprints are generated only in the ClientHello messages. Therefore, the technique has a disadvantage in that several processes using the same ClientHello messages may belong to one fingerprint.
Accuracy can be improved by utilizing additional techniques, such as machine learning, statistical techniques, and usage of ServerHello information, to compensate for the shortcomings. This is done because the ClientHello of the client may remain the same but the ServerHello of the servers, which are connected, may differ for different applications. JA3S uses ServerHello to generate server fingerprints, while Joy uses the cipher suit selected by the server [35,36]. JA3S with JA3 fingerprints provides additional information on connections. Thus, a classification between connections with ClientHello may be performed. Joy allows the classification between connections, but it uses different fingerprinting information such as different protocols and Operating Systems.
To improve ClientHello/ServerHello-based techniques, Anderson and McGrew [37] used knowledge bases combined with end host and network data to identify the direction of trends in enterprise TLS applications. It also enhanced the understanding of application behavior. Anderson and McGrew [38] presented a method to improve the identification and classification accuracy of ClientHello-based TLS fingerprints using destination context and pre-collected knowledge bases. For this study, a similar fingerprint group was selected through approximate matching, using the Levenshtein distance algorithm, when no exact match for the fingerprint information was found. The matching probability was then computed using the weighted naive Bayes model learned with the destination context and knowledge bases. The process was identified as most likely.
Korczynésk and Duda [39] collected all handshake connections that occurred when accessing a particular service (Paypal, Twitter, Dropbox, etc.). Probabilities for the TLS protocol versions and the handshake message occurrences were derived and classified as Markov chains with parameters. Liu et al. [40] generated features by using length Markov models in addition to message-type Markov models [39]. Classification using machine-learning-based classifiers was performed. Liu et al. [41] proposed a method to improve the classification accuracy of application traffic by performing fingerprinting using cipher suite distribution as multi-attributions in addition to the length Markov models [40]. Classification accuracy was derived for applications such as MaMPF and improved accuracy was demonstrated [40]. An example of a statistical fingerprint obtained using a Markov chain and state transition graph is shown in Figure 7.
Chao [42] conducted a study to identify malicious encrypted traffic. The fingerprint was generated by adding a TCP handshake, SSL/TLS message type, and TCP four-way wave as fingerprint elements to the SSL/TLS handshake state transition-based fingerprint [39]. Encrypted traffic was also identified by generating a feature based on a 2-order Markov chain derived from the generated fingerprint and utilizing it for machine learning.
Zhao et al. [43] classified and identified encrypted traffic by replicating and analyzing traffic characteristics for pre-filtered traffic with header and handshake. Fingerprints based on a hidden Markov model were extracted to classify and identify applications. For this study, classification was performed on web applications, Real-time transport protocol, voice over internet protocol, and video-audio streaming media traffic.
In 2019, Garn et al. [44] identified a web browser, during a client application, by sending a test set consisting of a specific combination of sequences from the server during the TLS handshake process. When the browser sends a ClientHello message to the server, the server, configured with the proposed framework, sends a specific combination of server-response messages. The browser transmits a response message to the server message. During this handshake process, the browser fingerprints the message sent by the client as a feature vector. In 2022, Garn et al. [45] discussed a two-step method for identifying a browser, and the ClientHello-based fingerprinting was performed while connecting to the browser. The browser was identified using the old method when no unique matching results were found [44]. An example of a behavioral fingerprint using a combination of sequences is shown in Figure 8.
AI-based techniques extract features from encrypted traffic and learn artificial neural networks to perform fingerprinting. Feature extraction utilizes statistical, time series, graph, and hybrid (or miscellaneous) techniques. Statistical analysis extracts features using statistical techniques on data obtained from the traffic, while the time series technique extracts feature through meaningful information arranged in chronological order among data obtained from the traffic. A graph extracts feature using a traffic-tracking graph. A hybrid technique uses several complex techniques and other features. In particular, the technique can identify the transmitted content and the algorithm used for encrypting transmission data.
In 2017, Dubin et al. [46] proposed a method using machine learning for extracting features from video streaming traffic and for classifying the video titles uploaded to video platforms. The bit-per-peak feature, derived from the peak value in the download speed pattern during video streaming, is used. The video content transmitted through encrypted traffic was classified by a third party. Yang et al. [47] proposed a method specifying which images were viewed on performing fingerprinting with Markov chain for fragments of encrypted video traffic in addition to the previous method [46]. The state change diagram of the fragment sequence, modeled for video streaming with a specific title for the video, was uploaded as a Markov chain on YouTube, a video platform operator. Thereafter, the title of the image, viewed through the machine learning model learned with the modeled information, was a YouTube image.
Al-Naami et al. [48] extracted the burst size and time of length, downlink, and uplink from the header of the packet and configured a feature set called bi-directional dependent fingerprinting to study the machine learning model by performing a two-way application on a mobile website.
In 2021, Kanda and Hashimoto [49] proposed a technique for identifying encryption algorithms and libraries on encrypted payloads to compensate for the shortcomings of randomizing or modifying the handshake message parameters to bypass TLS fingerprinting. Based on the test features of NIST SP 800-22 "A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, " features were extracted, and the TLS version, cipher suite, and encryption algorithm were predicted [50]. An example of AI-based behavioral fingerprinting using statistical features is shown in Figure 9.
Böttinger et al. [51] used the length of The TLS record protocol as a feature to train a machine learning model to generate a classifier. The classifier classified the file format to be transmitted. The Classification was performed on application (ELF), voice (MP3), document (PDF), and image (JPG, WEBM) format files. In real scenarios, it was reported that fragment offset, uncertain compressor state, and symmetric block cipher padding acted as noise.
Zhang et al. [52] proposed features that could be used in techniques for fingerprinting websites encrypted with SSL/TLS and applied the features to a Deep Forest model. The proposed local request and response sequence feature was a value derived by grouping the size and packet number of in/outgoing packets according to various criteria. The developed model was then compared with other models and f1 score.
Lu et al. [53] used traffic-tracking graphs to fingerprint websites. The traffic-tracking graph was generated by dividing a packet into flows, by referring to a collection of packets with the same IP, and by using the packet length, time interval, directional sequence, and timestamp of the first packet. Subsequently, website classification was performed using the proposed graph-based machine learning model. A system overview of the graph attention pooling network-website fingerprinting (GAP-WF) is shown in Figure 10.
Richter et al. [54] presented four methods using HMAC size changes based on the selected cipher suite and TLS fragmentation. Five application layer protocols (namely, Hypertext Transfer Protocol, Simple Mail Transfer Protocol, Internet Relay Chat, Post Office Protocol3, and Internet Message Access Protocol) using Bayesian classifiers were classified. The TLS fragmentation, analyzed traffic structures, packet length, and inter-arrival time used as features are shown in Figure 11. Each method was made unique by using different feature extractions.
Anderson et al. [55,56,57] attempted to increase the accuracy of malware classification using various features. A dataset with a TLS version, offered cipher suits, TLS extension, selected cipher suite, client key length, sequence of record length, time, and type as features were constructed, and malware detection was performed [55]. Malicious traffic can be classified using a sequence of packet lengths (SPLTs), time, byte distribution (BD), with TLS, HTTP, and DNS metadata as features [56]. In the case of metadata for each protocol, data were extracted from the TLS handshake message, HTTP header, and DNS response without decryption. Malware detection was performed by learning the L1-logistic regression model. Malicious traffic can also be classified using SPLT, BD, and TLS metadata, and a custom feature server certificate was self-signed [57].
Based on the techniques discussed so far, a step-by-step TLS fingerprinting technique is presented in Figure 12.
First, in encrypted traffic fingerprinting, the TLS packet between the client and server is analyzed and executed. If the packet to be analyzed is classified by time, it can be divided into the key exchange, server parameter and authentication, and application data phases, which are termed Phase 1, 2, and 3, respectively, for convenience. Phase 1 is understood as a connection establishment attempt phase, Phase 2 is understood as a connection establishment phase, and Phase 3 is understood as a connected phase.
In Phase 1, pre-connection analysis is performed based on the Client/ServerHello fingerprints. In Phase 2, analysis is performed during the connection setup (handshake) using the metadata of the entire handshake phase. In Phase 3, analysis is performed while connected. Phase 1 has the advantage of quick analysis and pre-detection, but it is vulnerable to handshake modulation and is relatively less accurate when the analysis is performed using limited information. Phase 2 needs more information to perform high-accuracy detections and may still be vulnerable to handshake modulation. It has the disadvantage of being incapable of detecting content (e.g., insider information leakage) transmitted after being disconnected. In Phase 3, the information extracted from the application is used to detect the transmitted content based on the connection information that was analyzed until Phase 2.
Detailed fingerprinting can be performed using a greater number of phases, but a significant time and computing resources are consumed as more information is used. Therefore, applying a step-by-step technology according to the policy will increase the efficiency of the fingerprinting process. For example, the application phases may be configured differently depending on the importance and type of asset. For systems where identification of "who approaches" is important, the client should be analyzed using Phases 1 and 2, and for important assets that can cause serious damage in case of leakage, all three Phases should be used to identify content and detect insider information leakage.
In this study, encrypted traffic fingerprinting is analyzed by dividing it into fingerprint collection and AI techniques. Several advantages and disadvantages for each technique are identified.
Fingerprint collection techniques have the advantage of easy system construction when the fingerprint generation methods are clear. In addition, they are identified and classified through a fingerprint database comparison, which consumes less time and computing resources. In these techniques, pre-detection is possible because identification and classification can proceed during the handshake process. However, their accuracy is poor when fingerprints are not registered in the fingerprint database or when the client and server falsify and modulate the information used for fingerprint generation.
AI can perform calculations using various features and can infer encrypted and transmitted contents along with the encryption algorithms used. This is advantageous because it can identify leaked contents and report whether the information was leaked during detection. However, detection in advance is difficult because it can target the application layer or collect information on the entire connected traffic and then extract features and read the results.
As shown in Figure 12, a study capable of performing fingerprinting step-by-step should be conducted in the future to offset the advantages and disadvantages of each technique.
This work was supported by an Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No.2022-0-01200, Training Key Talents in Industrial Convergence Security).
The authors declare no conflict of interest.
[1] | L. Christian, World Health Organization Statistics 2015, 2015. Available from: https://apps.who.int/mediacentre/news/releases/2015/world-health-statistics-2015/fr/index.html. |
[2] |
M. Kane, Global programme for control of hepatitis B infection, Vaccine, 13 (1995), S47–S49. https://doi.org/10.1016/0264-410X(95)80050-N doi: 10.1016/0264-410X(95)80050-N
![]() |
[3] | A. A. Fall, Études de quelques modèles épidémiologiques: application à la transmission du virus de l'hépatite B en Afrique subsaharienne (cas du Sénégal), Ph.D thesis, Paul Verlaine-Metz University/Universite Gaston Berger, 2010. |
[4] | N. Otric, Cameroon Health Among the 17 Countries Most Affected by Hepatitis, 2017. Available from: https://www.cameroon-info.net/article/cameroun-sante-le-cameroun-parmi-les-17-pays-les-plus-touches-par-lhepatite-selon-296589.html. |
[5] |
G. M. Prifti, D. Moianos, E. Giannakopoulou, V. Pardali, J. E. Tavis, G. Zoidis, Recent advances in hepatitis B treatment, Pharmaceuticals, 14 (2021), 417. https://doi.org/10.3390/ph14050417 doi: 10.3390/ph14050417
![]() |
[6] |
C. W. Shepard, E. P. Simard, L. Finelli, A. E. Fiore, B. P. Bell, Hepatitis B virus infection: epidemiology and vaccination, Epidemiol. Rev., 28 (2006), 112–125. https://doi.org/10.1093/epirev/mxj009 doi: 10.1093/epirev/mxj009
![]() |
[7] | M. Matshidiso, Cameroon–The Government Aims to Lower to 53 the Prevalence Rate of Hepatitis B, 2020. Available from: https://actucameroun.com/2020/09/04/cameroun-le-gouvernement-ambitionne-de-baisser-a-53-le-taux-de-prevalence-de-lhepatite-b/. |
[8] | B. A. Collins, D. Meenakshi, O. Sakuya, 91 million Africans infected with hepatitis B or C, 2022. Available from: https://www.afro.who.int/fr/news/91-millions-dafricains-infectes-par-lhepatite-b-ou-c. |
[9] |
W. Garira, B. Maregere, The transmission mechanism theory of disease dynamics: Its aims, assumptions and limitations, Infect. Dis. Model., 8 (2023), 122–144. https://doi.org/10.1016/j.idm.2022.12.001 doi: 10.1016/j.idm.2022.12.001
![]() |
[10] |
W. Garira, K. Muzhinji, Application of the replication–transmission relativity theory in the development of multiscale models of infectious disease dynamics, J. Biol. Dyn., 17 (2023), 2255066. https://doi.org/10.1080/17513758.2023.2255066 doi: 10.1080/17513758.2023.2255066
![]() |
[11] |
W. Garira, The replication-transmission relativity theory for multiscale modelling of infectious disease systems, Sci. Rep., 9 (2019), 16353. https://doi.org/10.1038/s41598-019-52820-3 doi: 10.1038/s41598-019-52820-3
![]() |
[12] |
W. Garira, The research and development process for multiscale models of infectious disease systems, PLoS Comput. Biol., 16 (2020), e1007734. https://doi.org/10.1371/journal.pcbi.1007734 doi: 10.1371/journal.pcbi.1007734
![]() |
[13] |
M. A. Novak, S. Bonhoeffer, A. M. Hill, R. Boehme, H. C. Thomas, H. McDade, Viral dynamics in hepatitis B virus infection, Proc. Natl. Acad. Sci. USA, 93 (1996), 4398–4402. https://doi.org/10.1073/pnas.93.9.4398 doi: 10.1073/pnas.93.9.4398
![]() |
[14] |
A. V. Herz, S. Bonhoeffer, R. M. Anderson, R. M. May, M. A. Nowak, Viral dynamics in vivo: limitations on estimates of intracellular delay and virus decay, Proc. Natl. Acad. Sci. USA, 93 (1996), 7247–7251. https://doi.org/10.1073/pnas.93.14.7247 doi: 10.1073/pnas.93.14.7247
![]() |
[15] |
S. Zeuzem, A. Robert, P. Honkoop, W. K. Roth, S. W. Schalm, J. M. Schmidt, Dynamics of hepatitis B virus infection in vivo, J. Hepatol., 27 (1997), 431–436. https://doi.org/10.1016/S0168-8278(97)80345-5 doi: 10.1016/S0168-8278(97)80345-5
![]() |
[16] |
G. K. Lau, M. Tsiang, J. Hou, S. T. Yuen, W. F. Carman, L. Zhang, et al., Combination therapy with lamivudine and famciclovir for chronic hepatitis B–infected Chinese patients: a viral dynamics study, Hepatology, 32 (2000), 394–399. https://doi.org/10.1053/jhep.2000.9143 doi: 10.1053/jhep.2000.9143
![]() |
[17] |
S. R. Lewin, R. M. Ribeiro, T. Walters, G. K. Lau, S. Bowden, S. Locarnini, et al., Analysis of hepatitis B viral load decline under potent therapy: complex decay profiles observed, Hepatology, 34 (2001), 1012–1020. https://doi.org/10.1053/jhep.2001.28509 doi: 10.1053/jhep.2001.28509
![]() |
[18] |
N. Moolla, M. Kew, P. Arbuthnot, Regulatory elements of hepatitis B virus transcription, J. Viral Hepatitis, 9 (2002), 323–331. https://doi.org/10.1046/j.1365-2893.2002.00381.x doi: 10.1046/j.1365-2893.2002.00381.x
![]() |
[19] | V. Bruss, Hepatitis B virus morphogenesis, World J. Gastroenterol., 13 (2007), 65. |
[20] |
J. Nakabayashi, A. Sasaki, A mathematical model of the intracellular replication and within host evolution of hepatitis type B virus: Understanding the long time course of chronic hepatitis, J. Theor. Biol., 269 (2011), 318–329. https://doi.org/10.1016/j.jtbi.2010.10.024 doi: 10.1016/j.jtbi.2010.10.024
![]() |
[21] |
J. Nakabayashi, The intracellular dynamics of hepatitis B virus (HBV) replication with reproduced virion "re-cycling", J. Theor. Biol., 396 (2016), 154–162. http://dx.doi.org/10.1016/j.jtbi.2016.02.008 doi: 10.1016/j.jtbi.2016.02.008
![]() |
[22] |
W. Garira, A complete categorization of multiscale models of infectious disease systems, J. Biol. Dyn., 11 (2017), 378–435. https://doi.org/10.1080/17513758.2017.1367849 doi: 10.1080/17513758.2017.1367849
![]() |
[23] |
W. Garira, A primer on multiscale modelling of infectious disease systems, Infect. Dis. Model., 3 (2018), 176–191. https://doi.org/10.1016/j.idm.2018.09.005 doi: 10.1016/j.idm.2018.09.005
![]() |
[24] |
W. Garira, D. Mathebula, A coupled multiscale model to guide malaria control and elimination, J. Theor. Biol., 475 (2019), 34–59. https://doi.org/10.1016/j.jtbi.2019.05.011 doi: 10.1016/j.jtbi.2019.05.011
![]() |
[25] |
W. Garira, M. C. Mafunda, From individual health to community health: towards multiscale modeling of directly transmitted infectious disease systems, J. Biol. Syst., 27 (2019), 131–166. https://doi.org/10.1142/S0218339019500074 doi: 10.1142/S0218339019500074
![]() |
[26] |
D. C. Krakauer, N. L. Komarova, Levels of selection in positive-strand virus dynamics, J. Evolution. Biol., 16 (2003), 64–73. https://doi.org/10.1046/j.1420-9101.2003.00481.x doi: 10.1046/j.1420-9101.2003.00481.x
![]() |
[27] |
R. Netshikweta, W. Garira, A nested multiscale model to study paratuberculosis in ruminants, Front. Appl. Math. Stat., 8 (2022), 817060. https://doi.org/10.3389/fams.2022.817060 doi: 10.3389/fams.2022.817060
![]() |
[28] |
R. Netshikweta, W. Garira, An embedded multiscale modelling to guide control and elimination of paratuberculosis in ruminants, Comput. Math. Methods M., 2021 (2021), 9919700. https://doi.org/10.1155/2021/9919700 doi: 10.1155/2021/9919700
![]() |
[29] |
L. Rong, J. Guedj, H. Dahari, D. J. Coffield Jr, M. Levi, P. Smith, et al., Analysis of hepatitis C virus decline during treatment with the protease inhibitor danoprevir using a multiscale model, PLoS Comput. Biol., 9 (2013), e1002959. https://doi.org/10.1371/journal.pcbi.1002959 doi: 10.1371/journal.pcbi.1002959
![]() |
[30] |
I. Hosseini, F. Mac Gabhann, Multi-scale modeling of HIV infection in vitro and APOBEC3G-based anti-retroviral therapy, PLoS Comput. Biol., 8 (2012), e1002371. https://doi.org/10.1371/journal.pcbi.1002371 doi: 10.1371/journal.pcbi.1002371
![]() |
[31] |
G. W. Suryawanshi, A. Hoffmann, A multi-scale mathematical modeling framework to investigate anti-viral therapeutic opportunities in targeting HIV-1 accessory proteins, J. Theor. Biol., 386 (2015), 89–104. https://doi.org/10.1016/j.jtbi.2015.08.032 doi: 10.1016/j.jtbi.2015.08.032
![]() |
[32] |
J. Guedj, A. U. Neumann, Understanding hepatitis C viral dynamics with direct-acting antiviral agents due to the interplay between intracellular replication and cellular infection dynamics, J. Theor. Biol., 267 (2010), 330–340. https://doi.org/10.1016/j.jtbi.2010.08.036 doi: 10.1016/j.jtbi.2010.08.036
![]() |
[33] |
E. L. Haseltine, J. B. Rawlings, J. Yin, Dynamics of viral infections: incorporating both the intracellular and extracellular levels, Comput. Chem. Eng., 29 (2005), 675–686. https://doi.org/10.1016/j.compchemeng.2004.08.022 doi: 10.1016/j.compchemeng.2004.08.022
![]() |
[34] | P. Van den Driessche, J. Watmough, Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission, Math. Biosci., 180 (2002), 29–48. https://doi.org/10.1016/S0025-5564(02)00108-6 |
[35] | P. Van den Driessche, J. Watmough, Further notes on the basic reproduction number, in Mathematical Epidemiology, Springer, (2008), 159–178. https://doi.org/10.1007/978-3-540-78911-6_6 |
[36] | M. M. Ojo, F. O. Akinpelu, Lyapunov functions and global properties of seir epidemic model, Int. J. Chem. Math. Phys., 1 (2017), 11–16. |
[37] |
W. Garira, K. Muzhinji, The universal theory for multiscale modelling of infectious disease dynamics, Mathematics, 11 (2023), 3874. https://doi.org/10.3390/math11183874 doi: 10.3390/math11183874
![]() |
1. | Yuba R. Siwakoti, Danda B. Rawat, 2024, Detecting Malicious Traffic using JA3 Fingerprints Attributed ML Approach, 979-8-3503-5471-3, 128, 10.1109/ICDCSW63686.2024.00024 | |
2. | Jong Hyuk Park, Editorial: Artificial Intelligence-based Security Applications and Services for Smart Cities, 2024, 21, 1551-0018, 7012, 10.3934/mbe.2024307 | |
3. | Zelin Cui, Pu Dong, Dongxu Han, Bo Jiang, Zhigang Lu, Huamin Feng, 2025, Cross Page Recognition Methods for Encrypted Web Application Fingerprinting, 979-8-3315-1305-4, 423, 10.1109/CSCWD64889.2025.11033489 |
ClientHello field | [31] | [32] | [33] | [34] | [35] |
SSL/TLS record version | √ | √ | |||
SSL/TLS version | √ | √ | √ | √ | |
Cipher suites length | √ | ||||
Cipher suites | √ | √ | √ | √ | √ |
Extensions | √ | √ | √ | √ | |
Data of extensions | √ | ||||
Flag | √ | ||||
Compression length | √ | ||||
Compression | √ | ||||
Server name* | √ | ||||
Elliptic curves* | √ | ||||
Elliptic curve point formats* | √ |
Survey | Problem domains | Method | Protocols |
[9] | Classification | Focus on ML-based | Various |
[10] | Classification | ML-based | Various |
[11] | Detection | Focus on ML-based | TLS |
[12] | Analysis | Various | Various |
Present study | Fingerprinting | Various | TLS |
ClientHello field | [31] | [32] | [33] | [34] | [35] |
SSL/TLS record version | √ | √ | |||
SSL/TLS version | √ | √ | √ | √ | |
Cipher suites length | √ | ||||
Cipher suites | √ | √ | √ | √ | √ |
Extensions | √ | √ | √ | √ | |
Data of extensions | √ | ||||
Flag | √ | ||||
Compression length | √ | ||||
Compression | √ | ||||
Server name* | √ | ||||
Elliptic curves* | √ | ||||
Elliptic curve point formats* | √ |