Peptoids, or poly-N-substituted glycines, are synthetic polymers composed of a protein backbone with side chains attached to the nitrogen atoms rather than the α-carbons. Peptoids are biomimetic and protease resistant and have been explored for a variety of applications including pharmaceuticals and coatings. They are also foldamer-type materials that can adopt diverse structures based on the sequences of their side chains. Design of new peptoid sequences may lead to the creation of many interesting materials. Given the large number of possible peptoid side chains, computer models predicting peptoid structure-side chain relationships are desirable. In this paper, we provide a survey of computational efforts to understand and predict peptoid structures. We describe simulations at several levels of theory, along with their assumptions and results. We also discuss some challenges for future peptoid computational research.
1.
Introduction
Since the emergence of the COVID-19 pandemic around December 2019, the outbreak has snowballed globally [1,2], and there is no clear sign that the new confirmed cases and deaths are coming to an end. Though vaccines are rolling out to deter the spread of this pandemic, the mutations of the viruses are already under way [3,4,5,6]. Despite the fact and research that the origin of the pandemic is still in debate [7], many researchers are conducting their study from different aspects and perspectives. They could be categorised mainly into three levels: SARS-CoV-2 genetic level [8], COVID-19 individual country level [9,10,11] and continental levels [12,13]. In this study, we focus on the latter two levels. Regarding these two levels, there are many methods and techniques on these issues. For example, linear and non-linear growth models, together with 2-week-kernel-window regression, are exploited in modelling the exponential growth rate of COVID-19 confirmed cases [14] - which are also generalised to non-linear modelling of COVID-19 pandemic [15,16]. Some research works focus on the prediction of COVID-19 spread by estimating the lead-lag effects between different countries via time warping technique [17], while some utilise clustering analyses to group countries via epidemiological data of active cases, active cases per population, etc.[18]. In addition, there are other researches focusing on tackling the relationship between economic variables and COVID-19 related variables [19,20] - though both the results show there are no relation between economic freedom and COVID-19 deaths and no relation between the performance of equality markets and the COVID-19 cases and deaths.
In this study, we aim to extract the features of daily biweekly growth rates of cases and deaths on national and continental levels. We devise the orthonormal bases based on Fourier analysis [21,22], in particular Fourier coefficients for the potential features. For the national levels, we import the global time series data and sample 117 countries for 109 days [23,24]. Then we calculate the Euclidean distance matrices for the inner products between countries and between days. Based on the distance matrices, we then calculate their variabilities to delve into the distribution of the data. For the continental level, we also import the biweekly changes of cases and deaths for 5 continents as well as the world data with time series data for 447 days. Then we calculate their inner products with respect to the temporal frequencies and find the similarities of extracted features between continents.
For the national levels, the biweekly data bear higher temporal features than spatial features, i..e., as time goes by, the pandemic evolves more in the time dimension than the space (or country-wise) dimension. Moreover, there exists a strong concurrency between features for biweekly changes of cases and deaths, though there is no clear or stable trend for the extracted features. However, in the continental level, one observes that there is a stable trend of features regarding biweekly change. In addition, the extracted features between continents are similar to one another, except Asia whose features bear no clear similarities with other continents.
Our approach is based on orthonormal bases, which serve as the potential features for the biweekly change of cases and deaths. This method is straightforward and easy to comprehend. The limitations of this approach are the extracted features are based on the hidden frequencies of the dynamical structure, which is hard to assign a interpretable meaning for the frequencies, and the data fetched are not complete, due to the missing data in the database. However the results provided in this study could help one map out the evolutionary features of COVID-19.
2.
Method and implementation
Let δ:N→{0,1} be a function such that δ(n)=0 (or δn=0), if n∈2N and δ(n)=1, if n∈2N+1. Given a set of point data D={→v}⊆RN, we would like to decompose each →v into some frequency-based vectors by Fourier analysis. The features of COVID-19 case and death growth rates are specified by the orthogonal frequency vectors BN={→fij:1≤j≤N}Ni=1, which is based on Fourier analysis, in particular Fourier series [22], where
● →f1j=√1N for all 1≤j≤N;
● For any 2≤i≤N−1+δN,
● If N∈2N, then →fNj=√1N⋅cos(j⋅π) for all 1≤j≤N.
Now we have constructed an orthonormal basis FN={→f1,→f2,⋯,→fN} as features for RN. Now each →v=N∑i=1<→v,→fi>⋅→fi, where <,> is the inner product. The basis BN could also be represented by a matrix
where each →fij is defined in Eq 2.1.
Example 1. If N is 5, then the matrix representation of the orthonormal basis B5 is
and the representation of a data column vector →v={(-3,14,5,8,-12)} with respect to B5 is calculated by F5→v=[<→v,→fi>]5i=1 or a column vector or 5-by-1 matrix (5.367,-16.334,-3.271,-6.434,-9.503).
2.1. Data description and handling
There are two main parts of data collection and handling - one for individual countries (or national level) and the other for individual continents (or continental level). In both levels, we fetch the daily biweekly growth rates of confirmed COVID-19 cases and deaths from Our World in Data [23,24]. Then we use R programming 4.1.0 to handle the data and implement the procedures.
Sampled targets: national. After filtering out non-essential data and missing data, the effective sampled data are 117 countries with effective sampled 109 days as shown in Results. The days range from December 2020 to June 2021. Though the sampled days are not subsequent ones (due to the missing data), the biweekly information could still cover such loss. In the latter temporal and spatial analyses, we will conduct our study based on these data.
Sampled targets: continental. As for the continental data, we collect data regarding the world, Africa, Asia, Europe, North and South America. The sampled days range from March 22nd, 2020 to June 11th, 2021. In total, there are 449 days (this is different from the national level). In the latter temporal analysis (there is no spatial analysis in the continental level, due to the limited sampling size), we will conduct our study based on these data.
Notations: national. For further processing, let us utilise some notations to facilitate the introduction. Let the sampled countries be indexed by i=1,...,117. Let the sampled days be indexed by t=1,...,109. Days range from December 3rd 2020 to May 31st 2021. Let ci(t) and di(t) be the daily biweekly growth rates of confirmed cases and deaths in country i on day t, respectively, i.e.,
where casei,t and deathi,t denote the total confirmed cases and deaths for country i at day t, respectively. We form temporal and spatial vectors by
the vector ci and di give every count in time for a given country, and the vector v(t) and w(t) give every countries' count for a given time.
Notations: continental. For further processing, let us utilise some notations to facilitate the introduction. Let the sampled continents be indexed by j=1,...,6. Let the 447 sampled days range from March 22nd 2020 to June 11th 2021. We form temporal vectors for confirmed cases and deaths by
2.2. Implementation
For any m-by-n matrix A, we use min(A) to denote the value min{aij:1≤i≤m;1≤j≤n}. Similarly, we define max(A) by the same manner. If →v is a vector, we define min(→v) and max(→v) in the same manner. The implementation goes as follows:
(1) Extract and trim and source data.
Extraction: national. Extract the daily biweekly growth rates of COVID-19 cases and deaths from the database and trim the data. The trimmed data consist of 109 time series data for 117 countries as shown in Table 1, which consists of two 117-by-109 matrices:
Row i in the matrices are regarded as temporal vectors ci and di respectively, and Column t in the matrices are regarded as spatial vectors v(t) and w(t) respectively.
Extraction: continental. As for the continental data, they are collected by two 6-by-447 matrices:
(2) Specify the frequencies (features) for the imported data.
Basis: national. In order to decompose ci and di into some fundamental features, we specify F109 as the corresponding features, whereas to decompose v(t) and w(t), we specify F117 as the corresponding features. The results are presented in Table 2.
Basis: continental. In order to decompose xj and yj into some fundamental features, we specify F447 as the corresponding features.
(3) Compute the sets of the representations with respect to various bases.
Representation: national. The temporal representations of of ci and di with respect to F109 are calculated by
and the spatial representations of v(t) and w(t) with respect to F117 are calculated by
The results are presented Results.
Representation: continental. The temporal representations of of xj and yj with respect to F447 are calculated by
(4) Compute the Euclidean distance matrices for the representations.
Euclidean: national. The distances between temporal representations with respect to cases and deaths by calculated by
The distances between spatial representations with respect to cases and deaths by calculated by
where dE is the usual Euclidean distance function. The results are presented in Results
Euclidean: continental. The distances between temporal representations with respect to cases and deaths by calculated by
(5) Compute the average variability based on the above distance matrices.
Average variability: national. For each country i, the temporal variabilities for confirmed cases and deaths are computed by
and for each day t, the spatial variabilities for confirmed cases and deaths are computed by
The results are presented in Results.
Average variability: continental. For each continent j, the temporal variabilities for confirmed cases and deaths are computed by
(6) Unify the national temporal and spatial variabilities of cases and deaths. For each country i, the unified temporal and spatial variabilities for cases and deaths are defined by
● bvar_case_time[i]=var_case_time[i]−mn1mx1−mn1;
● bvar_death_time[i]=var_death_time[i]−mn2mx2−mn2;
● bvar_case_space[t]=var_case_space[t]−mn3mx3−mn3;
● bvar_death_space[t]=var_death_space[t]−mn4mx4−mn4,
where
● mn1=min(var_case_time);mx1=max(var_case_time);
● mn2=min(var_death_time);mx2=max(var_death_time);
● mn3=min(var_case_space);mx3=max(var_case_space);
● mn4=min(var_death_space);mx4=max(var_death_space). The results are shown in Results.
(7) Unified temporal representations with respect to continental confirmed cases and deaths by matrices whose (i,j) cell are defined by
where σij and βij denotes the value in the (i,j) cells of IP_cont_cases_time and IP_cont_deaths_time, respectively. The results are visualised by figures in Results.
3.
Results
There are two main parts of results shown in this section: national results and continental results.
National results. Based on the method mentioned in section 2, we identify the temporal orthonormal frequencies and spatial ones as shown in Table 2.
The computed inner products at country levels, served as the values for extracted features, for daily biweekly growth rates of cases and deaths with respect to temporal frequencies are shown in Figure 1. Similarly, the computed inner products at a country level for daily biweekly growth rates of cases and deaths with respect to spatial frequencies are shown in Figure 2. Meanwhile, their scaled variabilities are plotted in Figure 3.
Continental results. According to the obtained data, we study and compare continental features of daily biweekly growth rates of confirmed cases and deaths of Africa, Asia, Europe, North America, South America and World. Unlike the missing data in analysing individual countries, the continental data are complete. We take the samples from March 22nd, 2020 to June 11th, 2021. In total, there are 447 days for the analysis. The cosine values which compute the similarities between representations for continents are shown in Table 3. The results of the unified inner products with respect to confirmed cases and deaths are plotted in Figures 4 and 5, respectively.
Other auxiliary results that support the plotting of the graphs are also appended in Appendix. The names of the sampled 117 countries are provided in Tables A1 and A2. The dates of the sampled days are provided in Figure A1. The tabulated results for inner product of temporal and spatial frequencies on a national level are provided in Table A3. The tabulated results for inner product of temporal frequencies on a continental level are provided in Table A4. The Euclidean distance matrices for temporal and spatial representations with respect to confirmed cases and deaths are tabulated in Table A5 and their average variabilities are tabulated in Table A6.
Summaries of results. Based on the previous tables and figures, we have the following results.
(1) From Figures 1 and 2, one observes that the temporal features are much more distinct that the spatial features, i.e., if one fixes one day and extracts the features from the spatial frequencies, he obtains less distinct features when comparing with fixing one country and extracting the features from the temporal frequencies. This indicates that SARS-CoV-2 evolves and mutates mainly according to time than space.
(2) For individual countries, the features for the biweekly changes of cases are almost concurrent with those of deaths. This indicates biweekly changes of cases and deaths share the similar features. In some sense, the change of deaths is still in tune with the change of confirmed cases, i.e., there is no substantial change between their relationship.
(3) For individual countries, the extracted features go up and down intermittently and there is no obvious trend. This indicates the virus is still very versatile and hard to capture its fixed features in a country-level.
(4) From Figure 3, one observes that there is a clear similarities, in terms of variabilities, for both daily biweekly growth rates of cases and deaths under temporal frequencies. Moreover, the distribution of overall data is not condensed, where middle, labelled countries are scattering around the whole data. This indicates the diversity of daily biweekly growth rates of cases and deaths across countries is still very high.
(5) From Figure 3, the daily biweekly growth rates of deaths with respect to the spatial frequencies are fairly concentrated. This indicates the extracted features regarding deaths are stable, i.e., there are clearer and stabler spatial features for daily biweekly growth rates of deaths.
(6) Comparing the individual graphs in Figures 4 and 5, they bear pretty much the same shape, but in different scale - with death being higher feature oriented (this is also witnessed in a country-level as claimed in the first result above). This indicates there is a very clear trend of features regarding daily biweekly growth rates in a continental level (this is a stark contrast to the third claimed result above).
(7) From Figures 4 and 5, the higher values of inner products lie in both endpoints for biweekly change of cases and deaths, i.e., low temporal frequencies and high temporal frequencies for all the continents, except the biweekly change of deaths in Asia. This indicates the evolutionary patterns in Asia are very distinct from other continents.
(8) From Table 3, the extracted features are all very similar to each continents, except Asia. This echoes the above result.
4.
Conclusions and future work
In this study, we identify the features of daily biweekly growth rates of COVID-19 confirmed cases and deaths via orthonormal bases (features) which derive from Fourier analysis. Then we analyse the inner products which represent the levels of chosen features. The variabilities for each country show the levels of deaths under spatial frequencies are much more concentrated than others. The generated results are summarised in Results 3. There are some limitations in this study and future improvements to be done:
● The associated meanings of the orthonormal features from Fourier analysis are not yet fully explored;
● We use the Euclidean metric to measure the distances between features, which is then used to calculate the variabilities. Indeed Euclidean metric is noted for its geographical properties, but may not be the most suitable in the context of frequencies. One could further introduce other metrics and apply machine learning techniques to find out the optimal ones.
● In this study, we choose the daily biweekly growth rates of confirmed cases and deaths as our research sources. This is a one-sided story. To obtain a fuller picture of the dynamical features, one could add other variables for comparison.
Acknowledgements
This work is supported by the Humanities and Social Science Research Planning Fund Project under the Ministry of Education of China (No. 20XJAGAT001).
Conflict of interest
No potential conflict of interest was reported by the authors.
Appendix