
Citation: Sironmani Palraj, Muthaiah Selvaraj, Sundarajan Muthukrishnan. Effect of pretreatments on electrodeposited epoxy coatings for electronic industries[J]. AIMS Materials Science, 2016, 3(1): 214-230. doi: 10.3934/matersci.2016.1.214
[1] | Lukáš Pichl, Taisei Kaizoji . Volatility Analysis of Bitcoin Price Time Series. Quantitative Finance and Economics, 2017, 1(4): 474-485. doi: 10.3934/QFE.2017.4.474 |
[2] | Rubaiyat Ahsan Bhuiyan, Tanusree Chakravarty Mukherjee, Kazi Md Tarique, Changyong Zhang . Hedge asset for stock markets: Cryptocurrency, Cryptocurrency Volatility Index (CVI) or Commodity. Quantitative Finance and Economics, 2025, 9(1): 131-166. doi: 10.3934/QFE.2025005 |
[3] | Zheng Nan, Taisei Kaizoji . Bitcoin-based triangular arbitrage with the Euro/U.S. dollar as a foreign futures hedge: modeling with a bivariate GARCH model. Quantitative Finance and Economics, 2019, 3(2): 347-365. doi: 10.3934/QFE.2019.2.347 |
[4] | Makoto Nakakita, Teruo Nakatsuma . Analysis of the trading interval duration for the Bitcoin market using high-frequency transaction data. Quantitative Finance and Economics, 2025, 9(1): 202-241. doi: 10.3934/QFE.2025007 |
[5] | Lennart Ante . Bitcoin transactions, information asymmetry and trading volume. Quantitative Finance and Economics, 2020, 4(3): 365-381. doi: 10.3934/QFE.2020017 |
[6] | Sahar Charfi, Farouk Mselmi . Modeling exchange rate volatility: application of GARCH models with a Normal Tempered Stable distribution. Quantitative Finance and Economics, 2022, 6(2): 206-222. doi: 10.3934/QFE.2022009 |
[7] | Francisco Jareño, María de la O González, José M. Almansa . Interest rate sensitivity of traditional, green, and stable cryptocurrencies: A comparative study across market conditions. Quantitative Finance and Economics, 2025, 9(1): 100-130. doi: 10.3934/QFE.2025004 |
[8] | Tetsuya Takaishi . Volatility estimation using a rational GARCH model. Quantitative Finance and Economics, 2018, 2(1): 127-136. doi: 10.3934/QFE.2018.1.127 |
[9] | Bharti, Ashish Kumar . Asymmetrical herding in cryptocurrency: Impact of COVID 19. Quantitative Finance and Economics, 2022, 6(2): 326-341. doi: 10.3934/QFE.2022014 |
[10] | Nilcan Mert, Mustafa Caner Timur . Bitcoin and money supply relationship: An analysis of selected country economies. Quantitative Finance and Economics, 2023, 7(2): 229-248. doi: 10.3934/QFE.2023012 |
Pandemics caused by infectious diseases are becoming a constant threat in our globalized society. There are seasonal diseases like influenza, but also new diseases, often of zoonotic origin, like the recent case of COVID-19, cause by the coronavirus SARS-CoV-2. COVID-19 has become a pandemic starting in 2020 and it has left an important legacy in the form of extensive data covering various aspects relevant for the diffusion of the infection. During the COVID-19 pandemic, traditional compartmental modeling of infections, based on ordinary differential equations (ODEs), has been employed very successfully in describing the evolution of the incidence of infections. More advanced versions of the traditional models have been proposed and tested, taking advantage of the unprecedented data availability. Data on people's mobility and behavior, on the adoption of public measures, on economic restrictions and public policy, were also fundamental in trying to make sense of the pandemic as it happened. As a hindsight exercise, one of the major questions currently on the topic of infectious disease spread is how best to use the available data in transmission models in such a way that new insights can be uncovered and new lessons/conclusions can be drawn from our common recent COVID-19 experience, should we be again faced with similar situations.
Compartmental models are based on pioneering work in the early 20th century [1]. The first models where based on three compartments: susceptible (S), infectious (I) and removed (R), leading to the famous SIR model. More recently, the compartment of exposed (E) has been added to take into account the latency of the disease, leading to SEIR models. Such models are flexible enough to allow for interactions among separate geographical regions or age stratification. In the former case, each region has its own set of SEIR-type ODEs with connecting terms to reflect importation of cases from other regions. From the administrative point of view, the regional division of population (for example counties in the USA, or public health regions in the province of Ontario, Canada) is based on various socio-demographic criteria, history, etc. Established literature in disease transmission has looked at regions as cities, with commuter traffic and/or regular travel between them [2,3,4].
In general, SEIR-type models work well for large well-mixed and isolated populations within which the infections can easily spread [5]. This requirement is often not respected by administrative regional divisions. For instance, in the USA, counties are often sparsely populated or strongly connected to nearby counties by commuting population; states, instead, may encompass disconnected local areas or feature important cross-state commuting population. The main goal of our work is to recast small geographical units, like counties, in new well-mixed and isolated regions by use of available data on people's mobility. The new regions are defined by the following criteria: minimizing mobility between the new regions while creating well-mixed sub-populations in each such new region. Moreover, we introduce a notion of temporal stability of our clusters and use it to analyze the results throughout a 6 months time window. Clustering of populations is not a novel concept; researchers have studied clustering populations in order to improve government legislation, re-imagine municipal infrastructure, and observe the environmental impacts of carbon emissions [6,7]. This work is broadly focused on using individual mobility data along with other features to cluster populations based on similarity of those features [8,9]. Additional research has been done to discover high traffic areas within cities in order to understand traffic dynamics within regional populations [10]. Research into clustering regions based on interconnected mobility is less represented in the literature, to our knowledge.
Nevertheless, mobility networks have been used in conjunction with SEIR-type models in order to capture epidemiological dynamics of COVID-19 in urban populations [11,12,13]. Some of the research was focused on individual city dynamics in order to capture the spread within these smaller regions [11], while others have looked at quantifying the effect of mobility restrictions on the disease spread in Canada [12] and in other countries around the world [13]. At larger scales, mobility can be used to understand the spread of the disease across continents: for instance, the case of the second wave in 2020 in Europe has been studied in [14] while the spread in the USA has been modeled at the census division level [15]. Many other cases have been studied in the literature that cannot be briefly summarized, in all cases struggling with the need of finding appropriate sub-population characterization.
For our purpose, we will employ a simple machine learning approach to define the new regions, based on the criteria and datasets mentioned above. There are three general approaches to classification problems: supervised, semi-supervised, and unsupervised [16,17]. For the purposes of this article, we will focus on unsupervised classification, herein referred to as clustering [17]. In clustering, there is no information known regarding the true classification of data, unlike supervised and semi-supervised learning, wherein some or all of the information about the true classification is known [17]. Clustering assigns classes to objects in a dataset [17]. Many different clustering algorithms exist and have a broad scope of applications, although no one clustering method is superior in all situations [17]. Clustering algorithms vary in methodology and applications as described elsewhere [16,17,18]. For the purposes of this research, our methodology most resembles a type of fuzzy clustering, with distinct differences. Fuzzy Theory clustering algorithms look to apply a probability to an object belonging to a cluster [18]. In applying a probability to the clustering, the membership of a data point is shared among all clusters and thus the boundaries of the clusters become fuzzy [16]. In general, these approaches aim to minimize a cost-function and achieve some local-minima [16]. Our algorithm looks to minimize over some cost-function and define the membership to each cluster as a probability. The difference is that our algorithm uses the maximum probability to define clusters after training.
The structure of the paper is as follows: In Section 2 we introduce the clustering algorithm adopted in this work; in Section 3 we apply the algorithm to the USA counties in early 2020, focusing on the stability over the adoption of uneven measures; additionally in Section 3 we discuss the epidemiological implication of the adoption of the novel sub-populations; we finally offer our conclusions in Section 4.
Part of the data that support these findings (USA COVID surveillance) are publicly available through New York Times [19]. The other part of the data (USA contact rates) are available from data vendor Cuebiq but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Cuebiq. The mobility data from Cuebiq was provided as a weekly snapshot from January–June 2020. Several counties had missing mobility data, a list of these counties can be found in the supplementary material in Table A1. The Google mobility data and the Mobility Census data from Ontario are publicly available [20,21].
Gradient descent learning was first proposed by Louis-Augustin Cauchy in 1847 [22] as an optimization algorithm suited to solving systems of coupled differential equations. Gradient descent algorithms minimize some objective function with respect to its variables [23]. This is done by calculating the gradient of the objective function and taking a small step in the opposite (decreasing) direction of the gradient. The step size is controlled by a learning rate, which is, in general, rather small. Recently, these algorithms have been used to optimize neural networks [23].
Here, we apply this class of algorithm to a network formed by small geographical units (i.e., counties) that are connected by people's mobility among them. The main goal is to define a set of macro-regions formed by clusters of the network nodes, which are maximally connected inside each cluster and minimally connected to other clusters.
We first define an arbitrary number of clusters, denoted by $ N_\text{clusters} $, under the assumption that this number is much smaller than the number of nodes in the network, $ N_\text{clusters}\ll N_\text{nodes} $. Clusters can be initalized randomly or using some intuitive initalization. The main outcome of the algorithm will therefore yield the reorganization of the nodes into the desired number of clusters using spacial mobility data over a given time horizon. The clusters have no fixed size. Cluster size can range from including no nodes to including all nodes. In application, clusters never contain a large portion of the total number of nodes. We define a matrix of probabilities
$ P∈RNnodes×Nclusters, $
|
(2.1) |
where each element $ P_{ic} \in [0, 1] $ denotes the probability that the node $ i $ belongs to cluster $ c $. Each row is bounded by conservation of probabilities to respect the following sum-rule:
$ Nclusters∑c=1Pic=1. $
|
(2.2) |
The clustering algorithm must find the optimal probability matrix $ P $ following a loss function, which is defined based on the desired properties of the clusters.
By using continuous probabilities instead of Boolean assignments, we can define a differentiable objective function that can be optimized with gradient descent: successive small improvements can be made to the algorithm assignments, rather than having abrupt changes when nodes are reassigned. We then deterministically assign each node to its maximum-probability cluster to get the best node-to-cluster assignment solution. Using Batch Gradient descent, convergence is guaranteed in both convex and non-convex surfaces [23].
The optimization of the node-to-cluster assignment is based on a loss function that depends on the probability matrix $ P $. It measures how accurately any value of $ P $ matches the required features of the clusters. This loss function should evaluate to a large value when we have an inaccurate solution, but close to zero when we have an accurate node-to-cluster assignment solution.
To fulfill our purposes, the loss function must have two parts, as we need to regulate two important measures: low mobility interactions among clusters and low population difference among clusters. Henceforth, we define the loss function as a convex combination of two terms:
$ LossTotal=αInt LossInt+αPop LossPop, $
|
(2.3) |
where the constants $ \alpha_{\text{Int}}, \alpha_{\text{Pop}} \in \mathbb{R}^+ $ control the relative strength of the two requirements. Once the convex weights are fixed, we employ a gradient descent method to minimize the loss function and find the optimal probability matrix $ P $. The optimal $ \alpha_{\text{Int}}, \alpha_{\text{Pop}} $ were calculated by performing an analysis on the effects of clustering results based on a range of possible values of $ \alpha_{\text{Int}}, \alpha_{\text{Pop}} \in \mathbb{R}^+ $ and $ \alpha_{\text{Int}} + \alpha_{\text{Pop}} = 1 $. This work can be seen in Appendix.
The low mobility interaction $ \text{Loss}_{\text{Int}} $ takes into account the interactions between clusters, measured in terms of the population mobility among the nodes of the network. Hence, this measure relies on mobility data, expressed in terms of an interaction matrix, $ \text{Interaction}_{ij} $, whose elements are proportional to people's flow from node $ i $ to node $ j $. The loss function sums the interactions among all nodes belonging to different clusters, and it is defined as follows:
$ LossInt:=Nnodes∑i,j=1Interactionij Pdifferent(i,j), $
|
(2.4) |
with $ \mathbb{P}_\text{different}(i, j) $ being the probability of node $ i $ to be in a different cluster than node $ j $. By the definition of the probability matrix $ P $, we have
$ Pdifferent(i,j)=1−[PPT]ij, $
|
(2.5) |
hence the loss function can be written in terms of matrix operations as:
$ LossInt=Tr(Interaction(1−PPT)), $
|
(2.6) |
with $ \mathbf{1} $ being defined as a matrix of ones.
The low population difference $ \text{Loss}_{\text{Pop}} $ forces the solution to contain clusters of approximately equal population. The main purpose of this term is to force the algorithm away from a trivial solution where all nodes are joined in a single giant cluster while the other clusters are left empty, which trivially minimizes the inter-cluster interactions. Due to this trivial solution we require the additional loss function term defined in Eq (2.7). We define the loss function as follows:
$ LossPop:=Nclusters∑c=1(EP[Population of cluster c]−TotalPopNclusters)2, $
|
(2.7) |
where $ \mathbb{E}_P $ denotes the population of cluster $ c $ based on the node assignment given by the probabilities $ P_{ic} $ and TotalPop is the total population in the network. Defining a vector of the node populations $ \text{Pop}_i $, we have
$ EP[Population of cluster c]=Nnodes∑i=1Pic Popi=(PTPop)c. $
|
(2.8) |
Hence, the loss function can be written in terms of matrix operations as:
$ LossPop=|PTPop−TotalPopNclusters|2. $
|
(2.9) |
There is a subtle aspect in the implementation of a gradient descent algorithm to our problem. In fact, as the variables $ P $ must respect $ P_{ic} \in [0, 1] $ and $ \sum\nolimits_{c = 1}^{N_{\text{clusters}}}P_{ic} = 1 $, the gradient descent is not ideal as it would be applied to a constrained optimization problem with both box and linear constraints. Hence, to simplify the implementation, we apply the algorithm to a parametric matrix $ X \in \mathbb{R}^{N_{\text{nodes}}\times N_{\text{cluster}}} $, where each entry $ X_{ic} \in [-\infty, \infty] $ is unconstrained. In order to obtain $ P $ from $ X $, the real-valued vectors must be converted into probabilities. A common approach in neural networks and deep learning is to use the softmax function [24,25]. Each row of the matrix $ P $ can be redefined as the following:
$ Pic=eXicΣNclustersc′=1eXic′. $
|
(2.10) |
This is a common implementation artifact used in machine learning. To find a good cluster assignment configuration, the parameters at every step were updated using automatic differentiation in the grad package from the Jax library in Python using the following loss function:
$ Xnew=Xold−StepSize∇XLoss(X). $
|
(2.11) |
We apply the clustering algorithm to a network made of 3102 counties and county equivalents, located within the 50 states and the District of Columbia (DC). In total, there exist 3144 counties and county equivalents within the USA, however 42 counties were missing from the Cuebiq dataset, and they are not included within our network. These missing counties will appear in white on the maps of the USA, as seen for instance in Figure 1.
The mobility data was provided from Cuebiq and it contains the number of users of their proprietary app traveling from county $ i $ to county $ j $ normalized by the number of users seen in county $ i $. For each county, the data includes the 15 largest flows to other counties on a weekly timescale, hence the entries are largely dominated by commuter travelers among nearby counties. Airborne travelers, while not explicitly excluded, are numerically smaller than commuter and ground based ones and often do not make it above the cut or remain subleading. We consider the data from Cuebiq users as a good proxy of the total population of each county, as confirmed by the provider. Using this data from Cuebiq we were able to construct the 3102 by 3102 flow matrix used in Eq (2.4), where the matrix entries represented the flow from county $ i $ to county $ j $ normalized by the population of the county of origins [26].
In order to speed up the convergence of the algorithm, the variable $ X $ matrix was initialized using physical proximity among counties. To do this, we generated $ N_{\text{clusters}} $ "initialization central points" (ICPs) spread across the USA, and then initialized the $ X $-values in the algorithm for the $ N_{\text{counties}} $ nodes proportionally to the distances from these points. The distances are computed from the geographical center of each county $ i $ and the ICPs. Specifically, we set:
$ Xic=−0.1distance(Center of County i,ICP c), $
|
(2.12) |
where $ 1\leq i \leq N_{\text{counties}} $ is the county index and $ 1\leq c\leq N_{\text{cluster}} $ is the cluster index.
In this way, at initialization, the algorithm assigns the highest probability of belonging to the cluster of the nearest ICP. In practice, the ICPs are defined as a rectangular grid of equally spaced points covering the USA (including Alaska and Hawaii), see Figure 1(a) for an example with $ N_\text{clusters} = 100 $. During the initialization process only a subset of clusters are populated, the remaining are later populated by the algorithm. For instance, the clusterings in Figure 1(b), (c) were obtained for two different loss function combinations and after $ 50,000 $ gradient descent steps.
As it can be seen in Figure 1(b), sizable values of $ \alpha_\text{Pop} $ force the clusters to have similar population, however creating heterogeneity in the their geographical extension and discontinuities. This is due to the very heterogeneous distribution of the population, which leads to densely populated counties and very sparsely populated ones. As a result, the clustering in Figure 1(a) results in clusters consisting of a handful of counties near cities, in contrast to very extended and discontinuous ones in more rural areas. To adjust the outcome, we reduced the impact of the population requirement by choosing $ \alpha_\text{Pop} = 0.1 $ and $ \alpha_\text{Int} = 0.99 $. This resulted in the clustering in Figure 1(b), leading to more comparable and geographically continuous clusters. We then tested the algorithm with various numbers of clusters, obtaining comparable results.
In the remainder of this work we will focus on $ N_\text{clusters} = 49 $, which provides a number of clusters closely comparable to the number of states (50 + DC). Hence, we obtained the 49 clusters with pre-pandemic mobility levels, where the data was taken from the first week of January 2020. As a working point, we used skewed weights $ \alpha_\text{Int} = 0.99 $ and $ \alpha_\text{Pop} = 0.01 $ to minimize cluster-to-cluster mobility flows. The obtained clusters are visualized in Figure 2(a). To test the residual level of mobility inter-cluster, we computed the following matrix at the end of the algorithm run
$ Minter-cluster=XToutInteractionXout, $
|
(2.13) |
where $ X_\text{out} $ is the final value of the variable matrix $ X $ used to define the cluster probabilities via Eq (2.9). The entries of this matrix are visualized in Figure 2(b).
The vast majority of the activity is detected on the diagonal, which represent the mobility among counties belonging to the same clusters. Instead, off-diagonal entries feature very small values, indicating very limited mobility among clusters belonging to different clusters. This result, therefore, validates the effectiveness of the clustering algorithm.
In this section we review the results obtained from our model, exploring the stable clusters that were extracted from the mobility data obtained in the first 6 months of 2020. Using these "core clusters", we were able to show the temporal spread of COVID-19 from its origin to the entire USA in the first few weeks of 2020. It is clear to see that:
● The initial spread of the disease is due to case importation via flights, first appearing on the West Coast, then clearly mapped to spread via flights from some of the largest flight hubs in USA.
● In comparing the geographic clusters structure obtained from our algorithm with the USA States (census-based) structure, we highlight regions that had a disproportionately high number of COVID-19 cases. Some of these regions are large portions of a given census state, while others encompass two or more adjacent census states. This reveals that epidemiological data studied at state level is not always illuminating, and can be late in signaling cases rising at state level, while portions of a state could be already experiencing high transmission. Moreover, highlighting disease spread based on how people interact with their communities indicates that allocating public health resources to those areas could have a beneficial impact on the spread of future communicable disease outbreaks.
● Last but not least, we note the ability of this model to be applied to other regions beyond the USA using different (mobility, population, disease) data: As an example, we propose application to Ontario, Canada, where data on commuters and google mobility data are publicly available.
Over the course of a pandemic, people's mobility continually changes as different regions are imposed various non-pharmaceutical measures in order to contrast the spread of the disease. During the early phases of COVID19 we have seen some states/regions being placed in lockdown, while others had lighter restrictions. In principle, such changes could affect how counties are clustered by our algorithm via changes in the interaction matrix.
To test the stability of the algorithm results, we considered the first six months of 2020, which saw the first diffusion of the infections and the most severe travel and local mobility restrictions. During this period the majority of restrictions put in place by the USA government were to limit mobility [27]. Hence, we constructed six different clusterings using the mobility data from the first week of each month from January to June, 2020. We then use the output probability matrices $ P_\alpha $, where $ \alpha $ labels the month, to check the stability of the output.
To do so, we first define a product matrix $ M_\alpha $ which traces over the clusters:
$ Mα=PαPTα. $
|
(3.1) |
Each entry $ M_{ij} = \sum_{c = 1}^{N_\text{clusters}} P_{ic} P_{jc} $ measures how likely the county $ i $ belongs to the same cluster as the county $ j $, as the product of the two probability rows is maximized if the two coincide. To express this similarity measure more objectively, we normalize the entries of the matrix in Eq (3.1) as follows:
$ ˉMij=Mij√∑Nnodesj=1M2ij,fori∈[1,Nnodes]. $
|
(3.2) |
The similarity measures are expressed by the diagonal entries of the product matrix
$ Sαβ=ˉMαˉMTβ. $
|
(3.3) |
In fact, for $ \alpha = \beta $, the diagonal entries are all equal to 1 thanks to Eq (3.2). For $ \alpha \neq \beta $, closeness to unity for the diagonal entries measures how similar the two clusterings $ \alpha $ and $ \beta $ are to each other. We then construct the similarity matrix $ S $ for clusterings stemming from consecutive months from January to June, 2020. The distribution of the diagonal entries for the five cases are shown in Figure 3.
These results show that the majority of counties remain in the same cluster over the 6-month period, as the majority of the entries remain very close to 1. This fact allows us to distinguish counties that consistently belong to the same cluster, and counties that flip between different counties (at least once). The pruning of unstable counties can be performed by keeping only clusters whose diagonal $ S $-entries remain above a given threshold over the 6-month period under study. We show in Figure 4 the result for three values of the threshold: $ 0.1 $, $ 0.2 $ and $ 0.5 $. The maps clearly show that inconsistent clusters tend to be located within less populated areas, while densely populated areas remain stable.
As an interesting highlight, pruning with threshold of 0.5 seems to agree quite well with the rural-urban map of the USA which we include in Figure 5(b), as presented in [28]:
We find a good compromise to define stable clusters after pruning counties with a threshold of $ 0.25 $, as shown in Figure 4(b), because we are not looking to exclude all rural counties from our analysis. Hence, a pruning with a threshold of 0.25 allows us to define 43 core clusters to be used for further analyses in the following section (6 clusters being emptied). We also note that we can check the effect of the population constraints we imposed in the clustering algorithm. In Figure 6 we compare the distribution of population in the 50 + DC states and in the 43 core clusters remaining from above pruning. One can see that several core clusters have a population around 8 millions, while state populations are concentrated around smaller values with a few very populous exceptions. This shows that small weights such as $ \alpha_\text{Pop} $ = 0.1 should be sufficient to obtain clusters with fairly balanced populations.
COVID-19 incidence data is available at the county level in the USA. However, the small populations of many counties provide statistically poor data, which do not allow to clearly identify growing patterns. Hence, it is important to aggregate the data into larger geographical units. By applying the clustering algorithm we aimed at identifying well-mixed regional sub-populations. Having achieved our goal mathematically, we now focus our attention on the analysis of the epidemiological data in the newly formed sub-populations in the core clusters. As compared to a state-level analysis, our approach permits a new view on the initial COVID-19 spread, highlighting indirectly the importance of flights and case importations. As a consequence, our results call for more localized preventive measures in large states (such as California), and the need for coordination/cooperation among neighboring states in staving off disease spread.
The algorithmic definition of core clusters also provides a new view on the initial geographical spread of COVID-19 cases within the USA. In particular, it allows to see how the disease spreads geographically to the most populated areas, which are also served by airport hubs, which have been shown to play a crucial role for the diffusion of airborne diseases in many regions of the world (see for instance [15,29]). To illustrate this, we plot a timeline of the initial spread of COVID-19 starting from February 24 until March 16, 2020, hence over a 4-week period. In each geographical unit, we identify the time of transition from a disease-free state to the initial exponential increase in the incidence numbers. In Figure 7 we show the progression of the disease within core clusters (left panels) and states (right panels): each geographical unit is colored in red on the week when the exponential increase is first detected, then it turns green from the week after.
The difference between the two columns is telling. In the core cluster analysis, we see that the disease started in two clusters located in northern California and western Nevada. During the second week, there was a spread to nearby clusters in the west (Oregon, Seattle-area and Iowa) as well as to major airport hubs in the central and eastern part of the USA. We can easily identify isolated red clusters around Boston, New York, Washington, Chicago, Atlanta, Houston and Los Angeles in panel of Figure 7(a2). During the third week, the disease reaches the remainder of the territory, except for areas in Washington/Montana states and Alaska, which run red during the fourth week. The corresponding state-level analysis, shown in the right panels, features a similar overall pattern, however important details are missing or diluted. In particular, the importance of airports, which are indirectly highlighted in the core cluster analysis, is missing. Instead, the core cluster analysis confirms the results obtained, for instance, in [15], where evidence was collected that airborne traffic was the principal culprit for the spread of COVID-19 from California to the rest of the country with an analysis of data aggregated at census division level.
Furthermore, the use of the clustering algorithm allows to see specific features about the disease spread in local areas within some large states and also spanning across states. For instance, one can see that the disease starts effectively spreading in Texas near Houston and at the southern tip during the second week, while at state level one would conclude that Texas was one of the first affected states. These specific differences could inform policy, and could give decision makers and public health officials more tools to act on preventive measures very early on, rather than being guided by state-level views.
For visual confirmation of our results showing the importance of case importation by air, we plot the side-by-side panel of Figure 7(a2) and the enplanements map at the top 50 airports in the USA, courtesy of the USA Bureau of Transportation Statistics in Figure 8 below [30]. In the upper panel we see the red spots corresponding to airline hubs, while in the panel below we see the flight density (represented by the size of each bubble) around the top 50 airports in USA.
In this section we highlight the fact that looking at the transmission quantifiers in the core cluster geography vs. census-based geography may lead to interesting highlights: For instance, the case is clearly made below that state-level incidence reporting may be late in signaling state-transmission, when state level data overlooks hot transmission zones in a state, due to its adjacency with a different state. Moreover, core clusters overlapping two or more states give rise to identification of high transmission areas that may need policy and resource allocating coordination.
Starting from incidence data provided by the New York Times at the county level, we aggregated the data both at state and core cluster level. Our aim is to study the growth of the number of infections and compare different regions. We assumed that, at the beginning of a wave of infections, the incidence number $ \text{inc}(t) $ is given by an exponential curve of the type:
$ inc(t)=inc(0) eρt. $
|
(3.4) |
This behavior would signal the phase of exponential growth in the infections and the start of an epidemiological wave. Hence, we can compute a time-series of the exponential growth factor:
$ ρ=lninc(t+1)inc(t), with inc(t)≠0. $
|
(3.5) |
The growth factor $ \rho $ is computed from the initial phase of nearly exponential growth in the neighborhood of the disease-free equilibrium state, corresponding to a phase of linear growth in $ \log(\text{inc}) $ with slope $ \rho $. Using the growth factor estimates as above and the closed form formula for the initial reproduction number $ R_0 $ based on $ \rho $ (see for details the work of [31]*) We identify the initial fastest phase of nearly unchecked growth in any given region with the help of a piecewise linear fit to the log of the incidence. We utilize the $\texttt{R}$ function $\texttt{dpseg()}$, which is a part of the $\texttt{dpseg}$ package [32]. This function uses a dynamic programming algorithm to generate an optimal piecewise linear fit to a time series, which balances goodness of fit against an (adjustable) penalty for each additional segment. We then identified the earliest segment with the steepest positive slope (largest $ \rho $) as corresponding to the initial near-unchecked exponential growth phase. We introduced this computation earlier in our paper [33], where we deduce the exponential growth regime for the USA in early 2020 as shown in Figure 9.
*In a typical SEIR model, where $ \sigma $ is the susceptible to exposed rate and $ \gamma $ is the recovery rate, then $ R_0 = \frac{(\rho+\sigma)(\rho+\gamma)}{\sigma\gamma} $, where $ \rho $ is the exponential growth factor.
We visualize the values of $ R_0 $ obtained at state level in Figure 10(a) and by use of the core clusters in Figure 10(b). Here, the initial reproduction number $ R_0 $ is computed over weeks 9 to 14 of 2020, that is the period from February 24th to March 16th, 2020. The results are fairly compatible, however a visual comparison of the two maps show that some counties are characterized by very different values.
To better visualize the difference between the two approaches - state vs. core clusters - for each county we computed the difference in the local $ R_0 $ obtained by the two different aggregations:
$ ΔR0=R0(core cluster)−R0(state), $
|
(3.6) |
and show the values in Figure 10(c), where the color gradient corresponds to the size of the difference. We see that areas in yellow highlight the biggest differences, meaning that the $ R_0 $ values in those clusters were higher than the state-case data $ R_0 $ indicated. The darker blue areas indicate that the state-case data gave a higher value of $ R_0 $ than the clustering data. From a policy perspective, the yellow areas are the important ones, as their presence means that, looking strictly at the geographic state level, policymakers may feel optimistic about their state-wide $ R_0 $ values when, in effect, due to people's mobility, the initial force of infection is much higher in many of their counties (see Figure 10(c)). The core cluster analysis, therefore, provides more reliable results for local communities living in a fraction of a state counties, where localized measures could be implemented to limit the incidence of the disease.
In order to capture the state-level versus the core cluster based view of the population and the concurrent disease spread, we extracted a few sample states with their corresponding clusters. We observed three possible situations: ⅰ) a state is essentially its own cluster without interactions with other clusters (e.g., Alaska and Maine); ⅱ) a state contains several clusters (e.g., California and Florida), and ⅲ) a cluster overlaps more than one state land mass. Case ⅰ) illustrates a trivial equivalence between the two methods, hence in this section we focus the analysis on examples of Cases ⅱ) and ⅲ) above.
California is a clear example of Case ⅱ), see Figure 11(b), as the majority of its population and territory is comprised within 4 core clusters: 41, 42, 39 and 18 (note that 18 also includes one county from Arizona). In Figure 11(a) we show the cases per capita, per day, smoothed via a 7 day rolling average, for California (in black) as compared to its four core clusters. Notably, not all the state cases are included, as some counties are removed from core clusters or are contained in out-of-state clusters. Thanks to the core cluster analysis, we can identify areas within California where the disease activity seems higher than elsewhere. For instance, clusters 18 and 42 had higher than state average cases per capita, thus it stands to reason to localize non-pharmaceutical interventions or preventive treatments in those zones. In Table 1 we list the counties comprised within clusters 18 and 42 and their population. Cluster 42 corresponds to the urban area of Los Angeles, and it saw an early rise of cases compared to the counties in nearby cluster 42. Instead, cluster 18 shows a larger incidence number at later stages in the pandemic: this makes sense, as the LA area has a much higher population density (so spread is very likely at rapid pace), and it was first detected as case positive by our analysis in Figure 7 very early, Week of March 1st 2020. Cluster 18 has a population density is a 10th of the LA-area and the initial spread happened a week later. It contains 3 Californian congressional districts (San Bernardino, Riverside and Imperial), with the first two districts known to be going back and forth from republican to democrat political representation. Yuma county, in Arizona, is similarly republican in presidential voting, with some democratic local representatives. Based on known correlations between willingness to adopt NPI measures and political leaning of USA individuals (see [34]), in a speculative way, our analysis seems to highlight the same argument: the higher than average incidence in cluster 18, though a lot sparser populated than 42, may be due to individual behaviour, i.e., due to a higher presence of individuals disinclined to adopt NPI measures.
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
We performed a similar analysis for the state of Florida, which comprises two enclosed clusters, as shown in Figure 12 (while counties in the north are joined with the neighboring states). From Figure 12(a), we see that cluster 31 is disproportionately responsible for the spread during the 3 waves that Florida experienced in 2020, as compared to cluster 2. This effect could be explained by the higher population density in cluster 31, given by 780 people per square mile, as it also encloses Miami. Instead, cluster 2 has an average density of 318 people per square mile. Most likely, cluster 31 contains the counties of Monroe, Miami-Dade and Broward, the top 3 most tourist intensive and some of the nicest weather, thus providing ample opportunities for individuals to lower their risk perception of getting infected with COVID-19 [35].
To illustrate Case ⅲ) in this section, we took the example of the state of New Mexico: It is almost entirely covered by the much larger core cluster 38, which also includes a sizable part of Texas and a few counties in Colorado, as shown in Figure 13(b). In analogy to the analysis for Case ⅱ) before, in Figure 13(a) we show the cases per capita in the part of cluster 38 that overlaps with the states of New Mexico and Texas, versus the cases per capita in the whole cluster (in green). Interestingly, we see a rather different behavior in the two state portions of the cluster, due to the very different policies applied in the two states. Nevertheless, the interconnection among counties within the cluster, highlighted by our clustering algorithm, implies that disease could propagate from one side to the other far more easily that it could be expected by simply looking at the state boundaries. Here the analysis implies that some coordination and collaboration on non-pharmaceutical interventions and preventive policies against the spread of disease would be beneficial to New Mexico, and it would help the state of Texas as well. Since Texas has had very little control over its disease spread during the pandemic, its policy had directly negatively influenced New Mexico.
Our algorithm is built generically, all that is required are regions that can be subdivided into $ N $ sub-regions (such as counties or public health regions) and mobility data among sub-regions. To illustrate its versatility, we implemented it to other regions of the world, reporting here the case of the Ontario province in Canada. For the mobility information, in the case of Ontario, we switched from proprietary data to publicly available data. We used a 2016 work mobility survey as baseline for worker mobility and then adjusted it based on Google mobility index changes from baseline to obtain an interaction matrix. Moreover, we looked at Ontario as a collection of 34 public health regions, which are well-defined geographically.
To achieve a daily contact rate between health units in the province of Ontario across 23 months (from February 2020 to December 2021), we combined data from two sources: mobility reports by Google [20] and commuting flow data by the Government of Canada [21]. Both data sources are publicly available. Commuter flow data reveals 25.16% of the employed population in Ontario works outside the census division where they live. This data source highlights that 11 census divisions out of 49 have more than 40% of their workforce commuting daily outside the census division where they reside [21]. By combining this data set with the Google Mobility reports, we can estimate the daily contact rates among the network nodes in Ontario.
During the COVID-19 pandemic, Google Mobility reports captured changes in movement over time compared to baseline (pre-lockdown) activity in different categories, such as retail/recreation, transit stations, and workplaces [20]. Google's index data have been used in previous analyses [13,36,37]. Since Google split the mobility data into 51 regions in Ontario, corresponding to local municipalities, we had to combine several regions along their borders to obtain mobility data for 34 health units in this work (see [38] for more details). The same borders were considered to calculate the commuter rate for health units, as commuter data was reported at the census division level. Assume $ m_{ij} $'s represent entries of the commuter flow matrix between health units in Ontario, where $ i $ is the place of residence (POR) and $ j $ is the place of work (POW). Thus, the contact rate between the health unit $ i $, POR, and the health unit $ j $, POW, on day $ t $ is calculated as:
$ contactij=mijEmpigmi(t),wherem=Work, $
|
where $ \text{Emp}_{i} $ is the employed population size of health unit $ i $ and $ g^m(t) $ represents the fluctuation percentage in the Google Work Index on day $ t $ compared to the baseline Google Index.
The result of the clustering algorithm is shown in Figure 14, where we particularly focus on the southern region of Ontario, where most of the population is concentrated. This part is subdivided into 6 clusters (while two more are found in the northern part). Interestingly, the cluster structure roughly coincides with the administrative division of health units, where 5 regions are implemented: East, Central East, Central West, South West and Toronto [39]. While East matches with the definition of cluster 1 (blue), which comprises Ottawa, the other regions are rearranged by our algorithm. Noticeably, Toronto is merged with Central East and part of Central West, following the intense commuter flow to the main city of the province from nearby regions. The other clusters clearly center over Niagara (8), Waterloo (4) and Windsor (5). Hence, our cluster definition seems to better represent the demographic character of the province of Ontario as compared to the administrative division.
In this paper we formulated and studied a novel clustering algorithm based on mobility data among small geographical units, like counties. We applied it to the case of the counties in the USA across the first 6 months of the 2020 pandemic COVID-19, and we introduced the notion of core clusters. They are defined by counties that remain consistently into the same cluster over a long period, hence being insensitive to changes in mobility due to the implementation of local non-pharmaceutical measures. The core clusters provide comparable geographical units characterized by a sub-population that is well-mixed by internal people's mobility. Hence, they offer an ideal basis to study the diffusion of an infectious disease within a large region. This new approach allowed us to capture the spatio-temporal spread of COVID-19 across USA and highlight features that are washed-out in a state-basis analysis. For instance, we found that the initial spread of COVID-19 in 2020 started from the north of California to a series of hot-zones that feature an airport hub. This result further confirms the relevance of airborne passenger transportation, as first highlighted in an earlier study using the nine census division as geographical units [15]. Furthermore, an analysis of the incidence number in core clusters showed that some states would need a higher granularity of localized measures to dampen disease spread while others showed the need for cooperation in dampening measures between neighboring states. The core clusters we identified in the USA could also be used as an efficient basis to predict the spread of an infectious disease across the country by use of a diffusion model, like for instance the eRG [40].
The work that was done to cluster Ontario was a simple example of the capabilities of our algorithm. Given any geographical area, the population distribution, and a form of mobility data, the algorithm is able to cluster virtually any region in order to discover well-connected sub-regions. This paper focuses on the usefulness of this algorithm on the COVID-19 pandemic, however no aspect of the algorithm is limited to the COVID-19 pandemic. As such, this algorithm is capable of being applied to a variety of problems that rely on population mobility. The fact that the algorithm is modular allows a user to be able to add additional terms onto the loss function. This allows for individuals to specifically tailor this algorithm to their problem.
From a policy perspective, the versatility of applying our algorithm at any granularity level in a given large geographic area is of interest for governing bodies, who are concerned about differing sub-populations living and working there, and their connections with their neighbors. In the last section we pointed out specific regions with a disproportionate number of cases per capita relative to other regions, all within the same geographical/census area. Utilizing this information, areas which have experienced higher than state case numbers per capita during the pandemic can be extrapolated to have a higher probability of transmission of a pathogen such as SARS-COV-2, due to their internal well-mixing. Thus, they can be a target of resource allocation deployment, in order to reduce the impacts of a future pandemic/epidemic: For instance, identifying regions where two or more states could coordinate their public policy (such is the case with New Mexico and Texas), can also translate in a resource allocation marker, such as deployment of PPE/medical equipment, lab and test supplies, data analysis expertise and capacity, and later vaccine supplies. On the other hand, cooperation in regional public policy would equally help, where two adjacent states, with well-mixing interstate populations, could try to deeply resources and preventive measures in a more coordinated fashion. Last but not least, we exemplify here several ways and uses of our study, however each aspect of our analyses can be developed more in-depth, depending on user needs.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
All sources of funding for the study are disclosed below. M.-G. Cojocaru acknowledges support for the work from the National Sciences and Engineering Research Council (NSERC), via a Discovery Grant 400684 and an Alliance Grant Option Ⅱ (providing partial funding support for D. Lyver), and a Mathematics for Public Health Fields Institute grant (providing support for Z. Mohammadi). G. Cacciapaglia and C. Cot acknowledge partial support from the MITI project "Événements rares" of CNRS, project SpikeRG. Cuebiq mobility data was purchased and made available to the authors by Sanofi.
The authors declare there is no conflict of interest.
In order to address the problem of assessing how well the algorithm is clustering, a variation on the stochastic block model (SBM) is used. The SBM provides an ideal basis to ensure the accuracy of clustering algorithms [35]. An SBM forms a graph using the assumption that each node within a network belongs to a community and connects to other nodes within its own community with a probability $ p $, and connects to other nodes not within its own community with probability $ q $ [35]. Once this graph is formed, the goal is to be able to recover the communities based on the connections between nodes in a process called community detection [35]. Several types of recovery are possible based on the values of $ p $ and $ q $; exact recovery, partial recovery, and no recovery [35].
In order to achieve exact recovery using an SBM, the following inequality must hold [35]:
$ (n(p−q))2>2(n(p+q)), $
|
(A1) |
where $ n $ is the number of nodes in the graph, and $ p $ and $ q $ are as defined previously.
In order to use the SBM to test the algorithm, an example model was created such that the algorithm could be rigorously tested for performance. In order to create this test graph, three variables needed to be defined, $ p $, $ q $ and $ n $. The definition of these variables remains the same as previously mentioned, but these variables will change throughout testing in order to better understand the algorithm's capabilities [35].
This investigation is aiding in establishing optimal ranges for the weights of the loss function in Eq (2.3). In order to determine the optimal values for $ \alpha_{mobility} $ and $ \alpha_{Pop} $ a grid search was performed. From the grid search, heat maps were created in order to view the performance of the algorithm.
The performance in this case is defined by how close $ q $ can be to $ p $, such that the algorithm still performs exact recovery. Arbitrarily, the cluster number chosen was 5 and the number of nodes in the graph was 500; however, a range of values was being analyzed in order to understand the impact of manipulating each value. The initial test was done by looking over all possible $ \alpha_{mobility} $ and $ \alpha_{Pop} $, where $ \alpha_{Pop}+\alpha_{mobility} = 1 $ with a step size $ 0.1 $. Each combination was tested on a set of $ p $ and $ q $ values, such that $ p + q = 1 $ and $ p > q $. The initial test is visualized in Figure A1.
In Figure A1(a), as the $ \alpha_{mobility} $ value increases, the closer $ q $ and $ p $ are able to become before the algorithm is unable to exactly extract the communities. Continuing this process and reducing the step size allowed for a "zoomed-in" grid search for the region of best-fitting parameters. As well, the step size between the $ p $ and $ q $ values needed to shrink in order to find the boundary where the algorithm was unable to reconstruct the clusters for certain combinations of $ \alpha $'s.
We continually saw an improvement of the boundary as $ \alpha_{mobility} $ approaches $ 1 $. The heat map for the final test run was used to determine the optimal value of the $ \alpha $'$ s $ which can be seen in Figure A1(b).
This results in an optimal range for $ \alpha_{mobility} \in [0.97, 0.99] $ and $ \alpha_{Pop} \in [0.03, 0.01] $, which contains the estimated range that was assumed to provide the most accurate clusterization of the USA during previous testing.
Another interesting revelation that arose during this testing was the capability of this algorithm to be able to cluster past the bound of the Stochastic Block Model, SBM. Since the number of nodes was 500, the bound can be written as:
$ (500(p−q))2>2(500(1)). $
|
(A2) |
Since $ n = 500 $, and $ p + q = 1 $, rearranging and solving for p generates the bound $ p > 0.06 + q $ or $ p > 0.53 $. However, to obtain the results in Figure 5, the p-value is $ 0.501 $. This initially was a cause for concern; however, this is acceptable because of the additional information that is being provided to the system based on the population. This additional information is enough to allow the algorithm to detect communities past the theoretical bound. Since the number of nodes and the number of clusters were both arbitrarily chosen, the effects of altering these values on the algorithm's ability to extract the exact communities was examined.
First, the effect of changing the number of clusters was investigated, and initially, it was expected that minimal effects would be seen by adjusting the number of communities. While in terms of the most accurate values for the $ \alpha $ terms, this is true, the most accurate $ \alpha_{Pop} \in [0.05, 0.01] $ and $ \alpha_{mobility} \in [0.95, 0.99] $. However, there is effect on the accuracy of clustering when the number of communities increases.
Next, the effect of changing the number of nodes while keeping the number of communities constant was analyzed, followed by the effect of both the number of communities and the number of nodes growing. This resulted once again in no change in terms of the optimal alpha values, as is demonstrated in Figure A2.
One may also note that even for the optimal $ \alpha $ parameters, the closest that p and q can become until exact clustering is unable to be achieved is 0.7 and 0.3, respectively. A range of node values were analyzed up to $ n = 3500 $, but all resulted in a similar trend. The reason for stopping at $ n = 3500 $ was due to the forseen applications of this algorithm. It is our goal to ensure that this algorithm is able to cluster the USA accurately, and the USA contains approximately, 3100 counties.
We also attempted to analyze the results of increasing both the number of communities and the number of nodes to be analogous with the problem that is trying to be solved in the USA. However, this yielded no notable results.
The reason that some of these experiments yielded no notable results was because in increasing the number of communities and the number of nodes, the complexity of the problem increased. In order to understand the increase in complexity, it is critical to view what is being tested is whether a node has been placed in the correct community or if the node has been placed in the wrong community. This binary point of view allows the problem to be approached in a slightly different way. The number of connections between nodes in the same community grows by the following formula:
$ p(nk2). $
|
(A3) |
For $ n $ representing the number of nodes, $ k $ representing the number of communities, and $ p $ representing the probability a connection exists between two nodes with the same community. The number of connections between nodes outside its own community grows by the following formula:
$ q(n−n/k)n. $
|
(A4) |
Wherein $ n $ and $ k $ represent the same as in Eq (A3) and $ q $ represents the probability that a node exists between two nodes that are not in the same community.
Thus, as the number of nodes $ n $ and the number of communities $ k $ grows, there is a decrease in the performance of the algorithm. However, this decrease in performance is not seen in the clustering for the real-world application of the USA. The reason for this is due to the nature of human interaction within our collected data. The human interaction within our collected data causes a geological component to be introduced to the algorithm, as the flow metric that is being recorded is the actual travel between regions. This travel between regions is based on the geological location of each node, as for the most part, nodes that are very distant geographically have minimal to no travel between them. This in turn causes no large increase in the complexity of the problem, thus preventing the algorithm from failing. Importantly, this is only a hypothesis based on the properties of the real-world data versus the theoretical data that is generated from the SBM. This represents a topic that will require further research. An approach to this will likely involve similar techniques to that of the SBM, but a grid will be used in order to define distance within the network. Using this distance, it will be possible to only connect nodes that are in a neighbourhood around one another. Ideally, this distance metric will be capable to more closely capture the dynamics that are seen in real-world data.
Determining the accuracy of the model was done similarly to determining the best $ \alpha $ values. Using the fixed $ \alpha $ values $ \alpha_{Pop} = 0.01 $ and $ \alpha_{mobility} = 0.99 $, we ran the clustering algorithm on a set of $ p $ and $ q $ values in order to determine how well the algorithm clusters as it approaches the theoretical boundary Due to previous experiments to determine the optimal $ \alpha $ values, it was assumed that for $ k $ and $ n $ values $ k = 5 $ and $ n = 1000 $ the algorithm would be able to cluster the nodes perfectly even well past the theoretical boundary. This is demonstrated in Figure A3.
The blue line in Figure A3 represents the accuracy of clustering and shows that the algorithm is able to cluster the nodes into the correct communities past the theoretical boundary, which is represented by the orange horizontal line.
By increasing the number of nodes, the number of communities, or both, the problem becomes more difficult and a drop in accuracy is expected. By increasing the number of communities to $ k = 10 $, this effect is shown in Figure A4.
Figure A4 shows that, when clustering for a larger number of communities, the algorithm is no longer able to cluster past or even up to the theoretical boundary, as the problem has become too challenging. It is likely the accuracy of the algorithm would be affected less if a form of distance was introduced, as mentioned previously. Attempting to increase the number of nodes while keeping the number of communities constant had no notable effect on the accuracy of the algorithm, and thus it behaved very similarly to that of Figure A3.
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |
[1] | Machu W (1978) Handbook of Electro painting Technology, Electrochemical Publication, Ltd, London. |
[2] | Sankara Narayanan TSN (2005) Surface pretreatment by phosphate conversion coatings—A review. Adv Mater Sci 9: 130–177. |
[3] | Saravanan G, Mohan S (2009) Corrosion behavior of Cr electrodeposited from Cr(VI) and Cr(III)-baths using direct (DCD) and pulse electrodeposition (PED) techniques. Corros Sci 51: 197–202. |
[4] | Pierce PE (1981) The Physical Chemistry of the Cathodic Electrodeposition Process. J Coat Technol 53: 52–67. |
[5] | Vereecken J, Goeminne G, Terryn H, et al. (Eds) (1997) in organic and inorganic coatings for corrosion prevention—Research and Experiences, EFC publication No.20, The institute of Materials, London, p. 103. |
[6] | Gominne G, Terryn H, Vereecken J (1995) Characterization of conversion layers on aluminium by means of electrochemical impedance spectroscopy. J Electrochem Acta 40: 479–486. |
[7] | Miskovic-Stankovic VB, Stanic MR, Drazic DM (1999) Corrosion protection of aluminium by a cataphoretic epoxy coating. Pro Org Coat 36: 53–63. |
[8] |
Miskovic-Stankovic VB (2002) The mechanism of cathodic electrodeposition of epoxy coatings and the corrosion behaviour of the electro deposited coatings. J Serb Chem Soc 67: 305–324. doi: 10.2298/JSC0205305M
![]() |
[9] | He Y, Zhang Y, Wu F, et al. Electrodeposition properties of modified cational epoxy resin-type photoresist, Institute of Photographic Chemistry, Chinese Academy of Sciences, 100101, Beijing, China. |
[10] |
Krylova I (2001) Painting by electrodeposition on the eve of the 21st century-Review. Pro Org Coat 42: 119–131. doi: 10.1016/S0300-9440(01)00146-1
![]() |
[11] | ASTM standards D1730-09 (2014) Preparation of aluminum and aluminum alloy surface for Painting. |
[12] | Miskovic-Stankovic VB, Drazic DM, Teo dorovic MJ (1995) Electrolyte Penetration Through Epoxycoatings Electrodeposited on Steel. Corros Sci 37: 241–252. |
[13] | Miskovic-Stankovic VB, Drazic DM, Kacarevic-Popovic Z (1996) The Sorption Characteristics Of Epoxy Coatings Electrodeposited On Steel During Exposure To Different Corrosive Agents. Corros Sci 38: 513–1523. |
[14] | Drazic DM, Miskovic-Stankovic VB (1990) The effect of resin concentration and electrodeposition bath temperature on the corrosion behaviour of polymer coated steel. Pro Org Coat 18: 253–264. |
[15] |
Lazarevic ZZ, Miskovic-Stankovic VB, Kacarevic-Popovic Z, et al. (2005) Determination of protective properties of electrodeposited organic epoxy coatings on aluminium and modified aluminum surfaces. Corr Sci 47: 823–834. doi: 10.1016/j.corsci.2004.07.016
![]() |
[16] | Drazic DM, Miskovic-Stankovic VB (1990) The determination of the corrosive behavior of polymer-coated steel with A.C. impedance measurements. Corros Sci 30: 575–582. |
[17] | Lazarevic ZZ, Miskovic-Stankovic VB, Kacarevic-Popovic Z, et al. (2005) The study of corrosion stability of Organic Epoxy Protective coatings on Aluminium and Modified Aluminium surfaces. J Braz Chem Soc V 16: 98–102. |
[18] |
Clark J, McCreery RL (2002) Inhibition of Corrosion-Related Reduction Processes via Chromium Monolayer Formation. J Electrochem Soc 149: B379–386. doi: 10.1149/1.1494825
![]() |
[19] |
Shih T-S, Liu Z-B (2006) Thermally-Formed Oxide on Aluminum and Magnesium. Mater Trans 47: 1347–1353. doi: 10.2320/matertrans.47.1347
![]() |
[20] | Miskovic-Stankovic VB, Lazarevic ZZ, Kacarevic-Popovic Z, et al. (2002) Corrosion behaviour of epoxy coatings on modified aluminium surfaces. Bull Electrochem 18: 343–348. |
[21] |
Laget V, Jeffcoate CS, Isaacs HS, et al. (2003) Dehydration-Induced Loss of Corrosion Protection Properties in Chromate Conversion Coatings on Aluminum Alloy 2024-T3. J Electrochem Soc 150: B425–B432. doi: 10.1149/1.1593040
![]() |
[22] |
Campestrini P, Terryn H, Vereecken J, et al. (2004) Chromate Conversion Coating on Aluminum Alloys II: Effect of the Microstructure. J Electrochem Soc 151: B359–369. doi: 10.1149/1.1736682
![]() |
[23] | Notter IM, Gabe DR (1992) Porosity of Electrodeposited Coatings: its Cause, Nature, Effect and Management. Corr Rev 10: 217–280. |
1. | Mohammad Reza Abbaszadeh, Mehdi Jabbari Nooghabi, Mohammad Mahdi Rounaghi, Using Lyapunov’s method for analysing of chaotic behaviour on financial time series data: a case study on Tehran stock exchange, 2020, 2, 2689-3010, 297, 10.3934/NAR.2020017 | |
2. | Abderahman Rejeb, Karim Rejeb, John G. Keogh, Centralized vs. decentralized ledgers in the money supply process: a SWOT analysis, 2021, 5, 2573-0134, 40, 10.3934/QFE.2021003 | |
3. | Viviane Naimy, Omar Haddad, Gema Fernández-Avilés, Rim El Khoury, J E. Trinidad Segovia, The predictive capacity of GARCH-type models in measuring the volatility of crypto and world currencies, 2021, 16, 1932-6203, e0245904, 10.1371/journal.pone.0245904 | |
4. | Qiang Zhang, Wenliang Pan, Chengwei Li, Xueqin Wang, The conditional distance autocovariance function, 2021, 0319-5724, 10.1002/cjs.11610 | |
5. | Shuanglian Chen, Hao Dong, Dynamic Network Connectedness of Bitcoin Markets: Evidence from Realized Volatility, 2020, 8, 2296-424X, 10.3389/fphy.2020.582817 | |
6. | Chunliang Deng, Xingfa Zhang, Yuan Li, Qiang Xiong, Garch Model Test Using High-Frequency Data, 2020, 8, 2227-7390, 1922, 10.3390/math8111922 | |
7. | Zhenghui Li, Yan Wang, Zhehao Huang, Risk Connectedness Heterogeneity in the Cryptocurrency Markets, 2020, 8, 2296-424X, 10.3389/fphy.2020.00243 | |
8. | Xin Liang, Xingfa Zhang, Yuan Li, Chunliang Deng, Daily nonparametric ARCH(1) model estimation using intraday high frequency data, 2021, 6, 2473-6988, 3455, 10.3934/math.2021206 | |
9. | Pierre J. Venter, Eben Maré, GARCH Generated Volatility Indices of Bitcoin and CRIX, 2020, 13, 1911-8074, 121, 10.3390/jrfm13060121 | |
10. | Chunliang Deng, Xingfa Zhang, Yuan Li, Zefang Song, On the test of the volatility proxy model, 2020, 0361-0918, 1, 10.1080/03610918.2020.1836215 | |
11. | Hakan Pabuçcu, Serdar Ongan, Ayse Ongan, Forecasting the movements of Bitcoin prices: an application of machine learning algorithms, 2020, 4, 2573-0134, 679, 10.3934/QFE.2020031 | |
12. | Au Vo, Thomas A. Chapman, Yen-Sheng Lee, Examining Bitcoin and Economic Determinants: An Evolutionary Perspective, 2021, 0887-4417, 1, 10.1080/08874417.2020.1865851 | |
13. | Parul Bhatia, Preeti Bedi, Causal Linkages Among Cryptocurrency and Emerging Market Indices: An Empirical Investigation, 2022, 0972-2629, 097226292211056, 10.1177/09722629221105670 | |
14. | Amadeo Christopher, Kevin Deniswara, Bambang Leo Handoko, 2022, Forecasting Cryptocurrency Volatility Using GARCH and ARCH Model, 9781450396523, 121, 10.1145/3537693.3537712 | |
15. | Xiaoling Chen, Xingfa Zhang, Yuan Li, Qiang Xiong, Daily LGARCH model estimation using high frequency data, 2021, 1, 2769-2140, 165, 10.3934/DSFE.2021009 | |
16. | Răzvan Gabriel Hapau, 2022, Chapter 22, 978-3-031-09420-0, 387, 10.1007/978-3-031-09421-7_22 | |
17. | Stephen Zhang, Ganesh Mani, Popular cryptoassets (Bitcoin, Ethereum, and Dogecoin), Gold, and their relationships: volatility and correlation modeling, 2021, 4, 26667649, 30, 10.1016/j.dsm.2021.11.001 | |
18. | Elise Alfieri, Yann Ferrat, Une meilleure rémunération des mineurs : un effet positif sur la performance financière des cryptomonnaies, 2022, Prépublication, 1267-4982, I129, 10.3917/inno.pr2.0129 | |
19. | Sahil Sejwal, Kartik Aggarwal, Soumya Ranjan Nayak, Joseph Bamidele Awotunde, 2023, Chapter 22, 978-981-19-2003-5, 251, 10.1007/978-981-19-2004-2_22 | |
20. | Jihed Ben Nouir, Hayet Ben Haj Hamida, How do economic policy uncertainty and geopolitical risk drive Bitcoin volatility?, 2023, 64, 02755319, 101809, 10.1016/j.ribaf.2022.101809 | |
21. | Viviane Naimy, Omar Haddad, Rim El Khoury, 2022, Chapter 30, 978-3-031-04215-7, 347, 10.1007/978-3-031-04216-4_30 | |
22. | Francisco Jareño, María De La O González, Pascual Belmonte, Asymmetric interdependencies between cryptocurrency and commodity markets: the COVID-19 pandemic impact, 2022, 6, 2573-0134, 83, 10.3934/QFE.2022004 | |
23. | Mamoona Zahid, Farhat Iqbal, Dimitrios Koutmos, Forecasting Bitcoin Volatility Using Hybrid GARCH Models with Machine Learning, 2022, 10, 2227-9091, 237, 10.3390/risks10120237 | |
24. | Volkan ETEMAN, Erkan IŞIĞIÇOK, YÜKSEK FREKANSLI KRİPTO VARLIK OYNAKLIĞININ UZUN HAFIZA VE STOKASTİK ÖZELLİKLERİNİN FIGARCH MODELİ İLE İNCELENMESİ, 2022, 12, 1309-4602, 284, 10.53092/duiibfd.1124966 | |
25. | Noureddine Benlagha, Wael Hemrit, Asymmetric determinants of Bitcoin's wild price movements, 2023, 49, 0307-4358, 227, 10.1108/MF-03-2022-0105 | |
26. | Kennard Fung, Jiin Jeong, Javier Pereira, More to cryptos than bitcoin: A GARCH modelling of heterogeneous cryptocurrencies, 2022, 47, 15446123, 102544, 10.1016/j.frl.2021.102544 | |
27. | Sylvia Gottschalk, Digital currency price formation: A production cost perspective, 2022, 6, 2573-0134, 669, 10.3934/QFE.2022030 | |
28. | O.D. Adubisi, A. Abdulkadir, U.A. Farouk, H. Chiroma, The exponentiated half logistic skew-t distribution with GARCH-type volatility models, 2022, 16, 24682276, e01253, 10.1016/j.sciaf.2022.e01253 | |
29. | Zied Ftiti, Wael Louhichi, Hachmi Ben Ameur, Cryptocurrency volatility forecasting: What can we learn from the first wave of the COVID-19 outbreak?, 2021, 0254-5330, 10.1007/s10479-021-04116-x | |
30. | Chuanhai Zhang, Huan Ma, Gideon Bruce Arkorful, Zhe Peng, The impacts of futures trading on volatility and volatility asymmetry of Bitcoin returns, 2023, 86, 10575219, 102497, 10.1016/j.irfa.2023.102497 | |
31. | Xingren Xiang, Jiayuan Shen, Kaixiang Yang, Guoming Zhang, Jiren Qian, Chengyuan Zhu, 2022, Daily natural gas load forecasting based on sequence autocorrelation, 978-1-6654-6536-6, 1452, 10.1109/YAC57282.2022.10023872 | |
32. | Serkan Aras, On improving GARCH volatility forecasts for Bitcoin via a meta-learning approach, 2021, 230, 09507051, 107393, 10.1016/j.knosys.2021.107393 | |
33. | Samuel Asante Gyamerah, Ning Cai, Two-Stage Hybrid Machine Learning Model for High-Frequency Intraday Bitcoin Price Prediction Based on Technical Indicators, Variational Mode Decomposition, and Support Vector Regression, 2021, 2021, 1099-0526, 1, 10.1155/2021/1767708 | |
34. | Samuel Asante Gyamerah, Collins Abaitey, Modelling and forecasting the volatility of bitcoin futures: the role of distributional assumption in GARCH models, 2022, 2, 2769-2140, 321, 10.3934/DSFE.2022016 | |
35. | Yiming Wang, 2023, Cryptocurrency Market Volatility Forecasting, 9781450398046, 43, 10.1145/3584816.3584823 | |
36. | Cagri Ulu, The dynamic relationship between BTC with BIST and NASDAQ indices, 2023, 19, 2719-3454, 113, 10.2478/fiqf-2023-0030 | |
37. | Hachmi Ben Ameur, Zied Ftiti, Waël Louhichi, Interconnectedness of cryptocurrency markets: an intraday analysis of volatility spillovers based on realized volatility decomposition, 2024, 341, 0254-5330, 757, 10.1007/s10479-023-05757-w | |
38. | José Antonio Núñez-Mora, Mario Iván Contreras-Valdez, Roberto Joaquín Santillán-Salgado, Risk Premium of Bitcoin and Ethereum during the COVID-19 and Non-COVID-19 Periods: A High-Frequency Approach, 2023, 11, 2227-7390, 4395, 10.3390/math11204395 | |
39. | İbrahim Korkmaz Kahraman, Dündar Kök, Day-of-the-Week and Month-of-the-Year Effects in the Cryptocurrency Market, 2024, 2149-1658, 10.30798/makuiibf.1387108 | |
40. | Muhammad Abubakr Naeem, Raazia Gul, Muhammad Shafiullah, Sitara Karim, Brian M. Lucey, Tail risk spillovers between Shanghai oil and other markets, 2024, 130, 01409883, 107182, 10.1016/j.eneco.2023.107182 | |
41. | Rasa Bruzgė, Jurgita Černevičienė, Alfreda Šapkauskienė, Aida Mačerinskienė, Saulius Masteika, Kęstutis Driaunys, STYLIZED FACTS, VOLATILITY DYNAMICS AND RISK MEASURES OF CRYPTOCURRENCIES, 2023, 24, 1611-1699, 527, 10.3846/jbem.2023.19118 | |
42. | Hajo Holzmann, Bernhard Klar, Using proxies to improve forecast evaluation, 2023, 17, 1932-6157, 10.1214/22-AOAS1716 | |
43. | Sera Şanlı, Mehmet Balcılar, Mehmet Özmen, Predicting the volatility of Bitcoin returns based on kernel regression, 2023, 0254-5330, 10.1007/s10479-023-05490-4 | |
44. | Prakash Raj, Koushik Bera, N. Selvaraju, A hybrid model for intraday volatility prediction in Bitcoin markets, 2025, 78, 10629408, 102426, 10.1016/j.najef.2025.102426 | |
45. | Chikashi Tsuji, Dual asymmetries in Bitcoin, 2025, 15446123, 107450, 10.1016/j.frl.2025.107450 | |
46. | Chikashi Tsuji, The risk–return trade-off of Bitcoin: Evidence from regime-switching analysis, 2025, 11, 2314-7210, 10.1186/s43093-025-00551-5 | |
47. | Prakash Raj, Koushik Bera, N. Selvaraju, Power of decomposition in volatility forecasting for Bitcoins, 2025, 93, 0927538X, 102839, 10.1016/j.pacfin.2025.102839 |
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |