Modified regression and ANN model for load carrying capacity of corroded reinforced concrete beam

Ashhad Imam; Zaman Abbas Kazmi; Ashhad Imam; Zaman Abbas Kazmi

doi:10.3934/matersci.2017.5.1140

AIMS Materials Science

2017, Volume 4, Issue 5: 1140-1164. doi: 10.3934/matersci.2017.5.1140

Previous Article Next Article

Research article Topical Sections

Modified regression and ANN model for load carrying capacity of corroded reinforced concrete beam

Ashhad Imam ^,,
Zaman Abbas Kazmi

Department of Civil Engineering, SHUATS (Formerly AAI-DU), Allahabad-211007, U.P, India

Received: 11 August 2017 Accepted: 03 November 2017 Published: 08 November 2017

There have been many extensive studies on the prediction of the residual strength of corroded reinforced concrete beams from experimental and theoretical perspectives in the past. This article corroborated the findings of Azad et al. (2010) pertaining to the residual strength and safety of the corroded beams and an insight to develop an improved regression model to obtain more practical outcomes. The proposed model has further been verified with the past research data to obtain a validation error to its minimum count. The study is also followed by the use of soft computing technique like Artificial Neural Networks (ANN) to establish a method with substantial improvement in the prediction results of residual strength. One ANN model is proposed to predict the residual capacity of corroded reinforced concrete beams using the same data from Azad et al. (2010). The effects of fixed data stratification on the performance of the models have been studied. The results of the ANN model were found to be in good agreement with experimental values. When compared with the results of Azad et al. (2010), the ANN model with fixed data stratification gave a better prediction for residual strength with reference to correlation coefficient and error reduction. Hence, the reliability of ANN model is assured with the prediction work followed in this study.

Keywords:

Citation: Ashhad Imam, Zaman Abbas Kazmi. Modified regression and ANN model for load carrying capacity of corroded reinforced concrete beam[J]. AIMS Materials Science, 2017, 4(5): 1140-1164. doi: 10.3934/matersci.2017.5.1140

Related Papers:

[1]	H. Thomas Banks, Shuhua Hu, Zackary R. Kenz, Hien T. Tran . A comparison of nonlinear filtering approaches in the context of an HIV model. Mathematical Biosciences and Engineering, 2010, 7(2): 213-236. doi: 10.3934/mbe.2010.7.213
[2]	Weijie Wang, Shaoping Wang, Yixuan Geng, Yajing Qiao, Teresa Wu . An OGI model for personalized estimation of glucose and insulin concentration in plasma. Mathematical Biosciences and Engineering, 2021, 18(6): 8499-8523. doi: 10.3934/mbe.2021420
[3]	Sarita Bugalia, Jai Prakash Tripathi, Hao Wang . Estimating the time-dependent effective reproduction number and vaccination rate for COVID-19 in the USA and India. Mathematical Biosciences and Engineering, 2023, 20(3): 4673-4689. doi: 10.3934/mbe.2023216
[4]	Gerasimos G. Rigatos, Efthymia G. Rigatou, Jean Daniel Djida . Change detection in the dynamics of an intracellular protein synthesis model using nonlinear Kalman filtering. Mathematical Biosciences and Engineering, 2015, 12(5): 1017-1035. doi: 10.3934/mbe.2015.12.1017
[5]	Oren Barnea, Rami Yaari, Guy Katriel, Lewi Stone . Modelling seasonal influenza in Israel. Mathematical Biosciences and Engineering, 2011, 8(2): 561-573. doi: 10.3934/mbe.2011.8.561
[6]	Balázs Csutak, Gábor Szederkényi . Robust control and data reconstruction for nonlinear epidemiological models using feedback linearization and state estimation. Mathematical Biosciences and Engineering, 2025, 22(1): 109-137. doi: 10.3934/mbe.2025006
[7]	Damilola Olabode, Jordan Culp, Allison Fisher, Angela Tower, Dylan Hull-Nye, Xueying Wang . Deterministic and stochastic models for the epidemic dynamics of COVID-19 in Wuhan, China. Mathematical Biosciences and Engineering, 2021, 18(1): 950-967. doi: 10.3934/mbe.2021050
[8]	Mamadou L. Diouf, Abderrahman Iggidr, Max O. Souza . Stability and estimation problems related to a stage-structured epidemic model. Mathematical Biosciences and Engineering, 2019, 16(5): 4415-4432. doi: 10.3934/mbe.2019220
[9]	Tianfang Hou, Guijie Lan, Sanling Yuan, Tonghua Zhang . Threshold dynamics of a stochastic SIHR epidemic model of COVID-19 with general population-size dependent contact rate. Mathematical Biosciences and Engineering, 2022, 19(4): 4217-4236. doi: 10.3934/mbe.2022195
[10]	Said G. Nassr, Amal S. Hassan, Rehab Alsultan, Ahmed R. El-Saeed . Acceptance sampling plans for the three-parameter inverted Topp–Leone model. Mathematical Biosciences and Engineering, 2022, 19(12): 13628-13659. doi: 10.3934/mbe.2022636

Abstract

1. Introduction

G-quadruplex or G-tetrad (G4), is a thermodynamically stable structural element that is formed between clusters/stretches/tracts of Guanine (G) residues (|x|≥3) and is intra- or inter-molecular ^[1,2,3]. The intervening loops whence applicable are composed of one or more nucleotide(s) (N∈{A, U, T, G, C}) (Figure 1). G4 is found in DNA (telomeres, double-strand break sites, transcription start sites) and in the untranslated region(s) (5'-, 3'-UTR, introns) of mRNA ^[4,5]. In vivo, G4 may function to preserve the telomeric ends of chromosomes, repress or promote transcription and regulate translation ^[4,5]. The generic representation of an intra-strand G4 may be described as follows:

$\left(\left(\left(G_{t, k}\right)_{t \geq 3}\left(N_{h, k}\right)_{h \geq 1}\right)_{k = 3}\left(\left(G_{t, k}\right)_{t \geq 3}\right)_{k = 1}\right)_{m = 1}$

(Def. 1)

𝑡 ≔ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐺𝑢𝑎𝑛𝑖𝑛𝑒𝑠 𝑝𝑒𝑟 𝐺 − 𝑟𝑖𝑐ℎ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟

ℎ ≔ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑜𝑜𝑝 − 𝑓𝑜𝑟𝑚𝑖𝑛𝑔 𝑔𝑒𝑛𝑒𝑟𝑖𝑐 𝑖𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑖𝑛𝑔 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠

𝑘 ≔ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝑖𝑛𝑑𝑒𝑥

𝑚 ≔ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑛𝑑𝑠

𝐺 ≔ 𝐺𝑢𝑎𝑛𝑖𝑛𝑒

𝐴 ≔ 𝐴𝑑𝑒𝑛𝑖𝑛𝑒

𝑇 ≔ 𝑇ℎ𝑦𝑚𝑖𝑛𝑒

𝐶 ≔ 𝐶𝑦𝑡𝑜𝑠𝑖𝑛𝑒

𝑁 ≔ 𝐴𝑛𝑦 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒

The high melting temperature (T_m~60⁰C) of G4 implies that the mature quadruplex is stable and refractory to unfolding. This is partly due to stabilizing Hoogsteen (N7^gu1-N2^gu2; O6^gu1-N1^gu2) and reverse Hoogsteen (N7^gu1-N1^gu2; O6^gu1-N2^gu2) hydrogen bonding as well as π-orbital stacking between the purine rings of non-contiguous guanine pairs (gu1, gu2) (Figure 1) ^[6,7]. Additionally, the presence of Adenine residues in the intervening loops, variable loop length (h~1-30 Mer) and permutation have all been shown to contribute to the stability and thence persistence of the mature quadruplex ^[8,9,10,11].

Figure 1. Definition, delineation and identity of a short translatable G-quadruplex. G-quadruplexes are stable structural elements in DNA/RNA and are characterized by Hoogsteen and reverse Hoogsteen pairing along with π-bond purine stacking of non-contiguous Guanine residues. Here, a short intra-strand G-quadruplex (20≤N≤60;N∈{A, U, G, C}) is modeled in the PCS of the mRNA of a hypothetical gene and is translatable (TG4). The modeled TG4 is represented as a sequence of codons

$\left(\left(C O D_{q}\right)_{q \in \mathbb{N}}^{L}; C O D \in \boldsymbol{C O D}\right)$ . Whilst, an arbitrary Guanine-rich cluster/stretch/tract is represented by suitably scored Guanine containing codon(s), the selection of codons for the intervening loops is without constraint. Abbreviations: COD, set of vertebrate codons; L, total number of codons used to model the translatable G-quadruplex; N, Any generic ribonucleotide; PCS, protein coding segment; TG4, translatable G-quadruplex; q, numerical index of codon.

DownLoad: Full-Size Img PowerPoint

$\begin{aligned} T_{m} \propto &(\# { Adenine } / h) = \tau { . (\# Adenine/h) } \\ T_{m} &: = { Melting ~temperature } \\ \tau &: = { constant ~of ~proportionality } \\ h &: = { Length ~of~ intervening ~loops } \end{aligned}$

(1)

Despite the wide range of methods available that can predict G4 formation in DNA/RNA, there is poor agreement between sequence-based motif locators and empirically derived biophysical data ^{[12,13,14,15]}. Motif-independent methods such as those that directly measure the GC-content or the GC-/AT-skew of a query sequence and utilize this data to train machine learning algorithms may address some of these discrepancies ^[16,17,18].

Investigations into transcribed RNA suggests that secondary and tertiary forms (5'- and 3'-UTRs) may not only coexist with stretches of unfolded ribonucleotides, but can also be read by the ribosomal machinery. Non-canonical translation is described as: a) translation from atypical start sites AUG→{CUG, GUG} or b) peptides (≤100 aa) of short open reading frames (sORF)-encoded polypeptides (SEPs) and upstream open reading frames (uORFs) ^{[19,20,21,22]}. The latter are rarely silent and can function as modulators of metabolism (S-Adenosylmethionine decarboxylase, AMD1) or transcription (activating transcription factor, ATF4, H19; yeast AP-1 like, YAP1) and as generic transcription factors (general control protein, GCN4) ^[19]. G4 has also been observed in one or more exons of the prion protein (PRNP, exon 2), zinc finger protein (ZNF669, exon 1), β-amyloid secretase (BACE1, exon 3) and the estrogen receptor 1 (ESR1, exon 4) among several others ^{[16,23,24,25,26,27,28,29]}.

Whilst the presence of segments of folded mRNA may have a significant influence on the yield of the protein product(s), the effect on sequence whence part of the protein coding segment (PCS) is largely unknown ^{[4,5,30,31,32,33]}. Proteopathies are diseases that result directly from agammaegates of truncated and misfolded proteins. These may occur secondary to a faulty translation machinery such as a ribosome that has stalled on encountering a secondary or tertiay folded mRNA sub segment. Recent data suggests ~45% of the human genome may code for proteins that are either intrinsically disordered (IDPs) or comprise one or more sub-segments that are disordered (IDRs) ^[34]. The absence of delineable structural features notwithstanding, disordered regions are characterized by short linear motifs (SLiMS) and/or molecular recognition features (MoRFs) ^[34,35]. The improper folding and heightened degradation rates could lead to perturbed proteostasis and thence contribute to the pathogenesis of proteopathies ^[34,35]. Primary proteopathies are likely to result directly from mutations (point, chromosomal translocations) in the PCS of a gene. These include sickle cell disease (β^E6→V6-mediated defective polymerization), amylin-based type Ⅱ Diabetes Mellitus, Cystic Fibrosis (cystic fibrosis transmembrane conductance regulator), Alzheimer's disease (Amyloid β-peptide) and Parkinson's disease (α-synuclein) ^[36,37]. Secondary proteopathies, in contrast, result from motif or molecular mimicry of a host protein(s) by a pathogen. These are further classified into acute and chronic variants depending on the onset, genesis and/or resolution of the resultant infection or infestation ^[34,35].

G4 is known to stall the ribosome during translation and the resultant protein is truncated and/or degraded at an accelerated rate. The manuscript subsumes ribosomal read-through of mRNA with a G-quadruplex and assesses influence of the translated product to proteostasis. Here, I present a mathematical model of a short G4 (20–60 Mer) in the PCS, i.e., translatable G-quadruplex (TG4), in the mRNA of a hypothetical gene. The mapping uses several novel indices to annotate, classify and select suitable Guanine-containing codons (α) and amino acids (β). A generic algorithm then computes and validates, as proof-of-principle, possible peptides (pTG4_ij) that correspond to the modeled TG4 (pTG4_ij∈PTG4~TG4). Co-occurrence, homology and the distribution of overlapping/shared amino acids between PTG4 and the disorder promoting SLiMS are used to infer probable mechanisms of TG4~PTG4 facilitated misfolding. Standard bioinformatics indices (accuracy, precision, recall, p-value) are used to arrive at these conclusions.

2. Materials and methods

2.1. Mathematical expression for the canonical peptidome of a short translatable G-quadruplex (PTG4)

The objective of this investigation is to model a short G4 in an arbitrary PCS (TG4) which when translated will result in a set of peptides (PTG4) with an average length that is less than 100 amino acids. The hypothesis explored in this manuscript is that in the event of a ribosomal read-through, the translated mRNA, with its G4 will result in a modified protein product. This protein will then exhibit considerable propensity to misfold on account of the presence of one or more members of the PTG4.

2.1.1 Model of a translatable G-quadruplex (TG4)

SEPs-derived peptides with the lowest molecular weight (~2.5 KDa) and with lengths varying from ~7–20 aa were identified and used to define the boundaries of the peptides that comprise PTG4 ^[20,21]. The TG4 (m = 1) is therefore, modeled as an intra-strand sub sequence of the mRNA of a hypothetical gene and has a length of ~20–60 Mer. This is represented (with symbols and variable names as explained in Def. 1) as follows:

$T G 4: = \left(\left(\left(G_{t, k}\right)_{3 \leq t \leq 9}\left(N_{h, k}\right)_{2 \leq h \leq 7}\right)_{k = 3}\left(\left(G_{t, k}\right)_{3 \leq t \leq 9}\right)_{k = 1}\right)_{m = 1}$

(Def. 2)

2.1.2. Codon association as a suitable representation of the TG4

Since the Guanine-rich clusters and loops are contiguous, the aforementioned model (Def.2) of the TG4 may be approximated with a sequence of codons and is as under:

$T G 4: = \left(C O D_{q}\right)_{q \in \mathbb{N}}^{L} | C O D \in \boldsymbol{C O D}$

(Def. 3)

The algorithm to compute L, which is the number of codons needed to model TG4 is presented and is as follows:

$\begin{array}{lr} {1:}& \quad N \leftarrow\{u \in[20, 60)\}\\ {2:} & {r \leftarrow N ~mod ~3} \\ {3:} & {e \leftarrow N-((N ~mod ~3) / 3)}\\ {4:}& \quad If~ e \lt (\lfloor e\rfloor+\lceil e\rceil) / 2 ~then\\ {5:}& \quad L = \lceil e / 3\rceil\\ {6:}& \quad else ~If~ e \geq(\lfloor e\rfloor+\lceil e\rceil) / 2 ~then\\ {7:}& \quad L = \lfloor e / 3\rfloor\\ {8:}& \quad end~ I f \end{array}$

𝑁 ≔ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑖𝑏𝑜𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑚𝑜𝑑𝑒𝑙 𝑇𝐺4

𝐿 : = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑑𝑜𝑛𝑠 𝑛𝑒𝑒𝑑𝑒𝑑 𝑡𝑜 𝑚𝑜𝑑𝑒𝑙 𝑇𝐺4 (7 ≤ 𝐿 < 21)

𝑞 ≔ 𝑞𝑡ℎ 𝑐𝑜𝑑𝑜𝑛

𝑟 ≔ 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟 = {0, 1, 2}

𝑒, 𝑢 : = 𝐺𝑒𝑛𝑒𝑟𝑖𝑐 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠

𝑪𝑶𝑫 ≔ 𝑆𝑒𝑡 𝑜𝑓 𝑣𝑒𝑟𝑡𝑒𝑏𝑟𝑎𝑡𝑒 𝑐𝑜𝑑𝑜𝑛𝑠

2.1.3. Codon classification and amino acid composition of PTG4

The codons selected for modelling TG4 (Def. 3) comprised suitably scored Guanine-containing vertebrate codons $\left(\boldsymbol{g C O D}_{\boldsymbol{n}}^{+} \subset \boldsymbol{C O D}\right)$ for the Guanine-rich clusters/stretches/tracks (3≤t≤9;Defs. 1 and 2) and generic/no-stop codons for the intervening loops (Figures 2 and 3). Briefly, a Guanine-containing codon (gCOD_n) is scored by considering its association with two similar flanking codons, i.e., gCOD_n-1, gCOD_n, gCOD_n+1 such that there is at least one occurrence of 'GGGG^' (δ≥1.0) (Figures 2 and 3). This non-trivial case (4≤t≤9) is chosen since its trivial equivalent t = 3, is already subsumed (Defs. 1 and 2). Numerically,

Figure 2. Algorithm to delineate and assess relevance of the peptidome of the modeled translatable G-quadruplex. Sub sections 2.1.1–4 are devoted to constructing the model of a short translatable G-quadruplex in the PCS of the mRNA of a hypothetical gene. Briefly, Guanine-containing vertebrate codons are scored and selected using codon association as the underlying model. An amino acid is then scored on the basis of the proportion of the G4 favoring codons it possesses. This schema is deployed iteratively and results in the complete set of peptides for the modeled translatable G-quadruplex. Sub section 2.1.5 and 2.1.6 are used to assess the relevance of the predicted peptidome to the genesis of misfolding induced proteostasis. This is done by examining the co-occurrence, homology and distribution of overlapping/shared amino acids of members of this peptidome with one or more short linear motifs. The sequences utilized are length adjusted empirically determined disordered regions, full length protein sequences with disordered segments and taxonomically diverse generic protein sequences. Abbreviations: G4, G-quadruplex; mRNA, messenger ribonucleic acid; PCS, protein coding segment; PTG4, hypothetical peptidome of the modeled short translatable G-quadruplex; SLiMS, short linear motifs.

DownLoad: Full-Size Img PowerPoint

Figure 3. Schema to characterize a short translatable G-quadruplex (TG4) and its associated peptidome (PTG4). The TG4 is modeled as a short sequence (L = 20) of Guanine-containing codons

$\left(\left(C O D_{q}\right)_{q = 7}^{q = 20}; C O D \in \boldsymbol{C O D}\right)$ . Guanine-containing vertebrate codons (COD) are scored

$\left(\alpha_{a min o}^{c o d o n}>0.00\right)$ and selected

$\left(\boldsymbol{gCOD}_{\boldsymbol{a m i n o}}^{+} \subset \boldsymbol{COD}\right)$ using codon association as the underlying model. This partitions vertebrate codons into permissive

$\left(\boldsymbol{x G G, G x x, x G G, G G G, G G x, G x G}; \alpha_{a min o}^{c o d o n}>0.0000; n = 28\right)$ and non-permissive

$\left(\boldsymbol{x G x, x x x }; \alpha_{a m i n o}^{c o d o n} = 0.0000; n = 36\right)$ codons. Interestingly, the amber stop codon (UAG; α = 1.0022) is also selected by these criteria. An amino acid is then scored on the basis of the proportion of the G4 favouring codons amongst all the codons that code it

$\left(\beta_{a m i n o} = \left|\boldsymbol{g C O D}_{\boldsymbol{a m i n o}}^{+}\right| /\left|\boldsymbol{C O D}_{\boldsymbol{a m i n o}}\right|\right)$ . An interesting subset of these amino acids (β = 1.00) are those which are encoded entirely by codons which may favour G4 formation. These include Valine, Alanine, Aspartic- and Glutamic-acids, Methionine and Glycine. Abbreviations: COD, set of vertebrate codons; L, Length of codon model of TG4; q, numerical indices; x, generic ribonucleotide.

DownLoad: Full-Size Img PowerPoint

$\alpha_{ {codon}}^{a m i n o} = \gamma . \theta . \delta+\Omega$

(2)

𝛾 : = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑑𝑜𝑛 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 (𝛾 = 1/64 ≈ 0.02)

𝜃 ≔ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑑𝑜𝑛 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑎 𝑔𝑟𝑜𝑢𝑝 (𝜃 = {0.04, 0.11, 0.33, 1})

𝛿 ≔ 𝐷𝑖𝑠𝑡𝑖𝑛𝑐𝑡 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 ′𝐺𝐺𝐺𝐺′(𝛿 = {0, 1, 2, 6})

Ω ≔ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑐𝑜𝑑𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝛿 (Ω = {0, 1, 2})

Since the genetic code is degenerate, amino acids mapped from the selected codons are further scored and grouped (g1, g2, g3) (Figures 3 and 4).

Figure 4. Dissecting the peptidome of the modeled short translatable G-quadruplex. The schema (α, β) outlined in this work is used to construct a peptidome of the modeled short translatable G-quadruplex (PTG4). This peptidome is the finite union of several peptides (pTG4_ij∈PTG4|7≤i≤20, 1≤j≤J). Each such peptide has a subset of amino acids that corresponds to an arbitrary Guanine-rich cluster/stretch/tract and another for the intervening loops. Whilst, the former has a restricted composition of amino acids (β > 0;y∈Y), the latter is generic (β≥0;z∈Z). The manner in which this predicted peptidome may influence the genesis of primary and secondary misfolding induced proteostasis is next investigated. This is done by examining the co-occurrence of pTG4_ij with one or more SLiMS_w (1≤w≤3) in empirically determined disordered regions and full length protein sequences with disordered segments. These studies are complemented by investigating the distribution of sequences with shared amino acids ((z_n)_n≥2|z∈pTG4_ij∩SLiMS_w, pTG4_ij∈PTG4; SLiMS_w∈SLiMS) across taxa. Interestingly, most of the amino acids that comprise PTG4 include those that favour hyperphosphorylation (Serine, Threonine) and non-covalent complex formation (Alanine, Valine, Leucine, Isoleucine, Lysine, Arginine, Aspartic- and Glutamic-acids) and proteolytic cleavage. Abbreviations: g1, g2, g2, classification of amino acids used in this study, PTG4, hypothetical peptidome of the modeled short translatable G-quadruplex; SLiMS; short linear motifs; Y, Z; sets of amino acids.

DownLoad: Full-Size Img PowerPoint

$\beta_{{amino}} = ^{\left|\boldsymbol{gCOD}_{\boldsymbol {amino}}^{+}\right| }/_{\left|\boldsymbol{C O D}_{\boldsymbol {amino}}\right|}$

(6)

$\boldsymbol{gCOD}_{\boldsymbol {amino}}^{+}: =$ set of optimal codons for each amino acid $\left(\alpha_{c o d o n}^{a m i n o}>0.0000\right)$

$\boldsymbol{C O D}_{\boldsymbol {amino}}: =$ Set of codons for each amino acid

2.1.4. Characterize the peptidome corresponding to TG4 (PTG4)

Whilst amino acids from groups 1 and 2 (β > 0.00) (3) can represent the modeled G-rich clusters (y∈g1∪g2 = Y), no constraint was imposed on the amino acids (z) used to model the loops (z∈g1∪g2∪g3 = Z) (Figures 3 and 4). The peptidome (PTG4) evaluated by this study is a combinatorial association of peptides such that the molecular weight is ~0.8–2.3 KDa and length of any arbitrary member is ~7–20 aa (Figure 4). This may be represented as follows:

$p T G 4_{i j} = \left(\left(\left(\left(y_{i, k}\right)_{1 \leq i \leq 3}\left(z_{i, k}\right)_{1 \leq i \leq 2}\right)_{k = 3}\left(y_{i}\right)_{1 \leq i \leq 3}\right)\left(z_{i}\right)_{1 \leq i \leq 2}\right)_{j}$

(Def. 4)

$\boldsymbol{P T G} {\bf 4} = \bigcup _{i = 7}^{i = 20} \bigcup _{j = 1}^{j = J}\left|p T G 4_{i j}\right|$

(Def. 5)

𝑷𝑻𝑮𝟒 : = 𝑃𝑒𝑝𝑡𝑖𝑑𝑜𝑚𝑒 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝑇𝐺4

𝑝𝑇𝐺4𝑖𝑗 : = 𝑗𝑡ℎ 𝑐𝑎𝑛𝑜𝑛𝑖𝑐𝑎𝑙 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑓𝑜𝑟𝑚 𝑜𝑓 𝑃𝑇𝐺4 𝑤𝑖𝑡ℎ "i" 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠

𝑖 : = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠 𝑡ℎ𝑎𝑡 𝑐𝑜𝑚𝑝𝑟𝑖𝑠𝑒 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙𝑙𝑒𝑑 𝑷𝑻𝑮𝟒

𝐽 : = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑛𝑜𝑛𝑖𝑐𝑎𝑙 𝑝𝑇𝐺4 𝑓𝑜𝑟 "𝑖" 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠

2.1.5. Establish proof-of-principle of biological relevance of PTG4

A dataset that comprises experimentally validated G4-forming mRNA segments of several genes (n = 99) was downloaded (http://scottgroup.med.usherbrooke.ca/G4RNA/) and used to investigate the distribution of G4 ^[16]. Genes which possess non-redundant RNA (R) sub sequences in the PCS are translated in 6 reading frames using an online tool (http://web.expasy.org/translate). The peptides generated are classified as those: ⅰ) with one or more uninterrupted stretch of N-terminal amino acids of length ≥7 aa (~A), ⅱ) with an in-frame termination signal designated as 'STOP' (~B) and ⅲ) without any termination signal, i.e., absence of a 'STOP' in their sequence (~C). The translated peptides are classified as "VALID" ((B∩A)∪(C∩A)) and then queried for matches with $p T G 4_{i j}(7 \leq i \leq 20, j \in \mathbb{N})$ . The PERL scripts that are required to parse and process the resulting data files have been developed in house and the pseudocode for the same is presented as additional information (Pseudocode, PS1: Supplementary Text 1).

2.1.6. In silico assessment of PTG4 to misfolding induced proteostasis

This is done by examining the occurrence of PTG4 in amino acid/protein sequences of disordered regions (IDRs) and full-length proteins with disordered regions (IDPs). DisProt 7.0 (http://disprot.org), is a database of experimentally validated and non-redundant sequences of IDRs and IDPs ^[38]. The sequences (|IDR| = 1445;|IDP| = 800) that comprise these are queried for occurrences of $p T G 4_{i j}(7 \leq i \leq 20, j \in \mathbb{N})$ (Supplementary Texts 2 and 3). A preliminary partitioning schema divides these datasets into two distinct subsets, i.e., #pTG4_ij≥1 (PT⁺≡PPOS⊂{IDR, IDP}; (Def.6)) and #pTG4_ij = 0 (PT^-≡PNEG⊂{IDR, IDP}; (Def. 7)). The extent of co-occurrence of one or more SLiMS_w≡SL (w = {1, 2, 3}) with pTG4_ij (SL^±∈{PPOS, PNEG}) (Defs.8 and 9) is then evaluated to infer relevance of PTG4 to misfolding induced proteostasis. The distribution of overlapping/shared sequences of amino acids ((z_n)_n≥2∈(pTG4_ij∩SLiMS_w); z_n∈Z; ) (Def.10), is examined in protein sequences from taxonomically diverse organisms with ScanProsite (https://prosite.expasy.org/scanprosite). The proof behind this rationale is presented:

$\begin{aligned} \left(z_{n}\right)_{n \geq 2} ~~~~~~~~~~~~~& \in\left(p T G 4_{i j} \cap S L i M S_{w}\right) \\ & = \left(\left(z_{n}\right)_{n \geq 2} \in p T G 4_{i j}\right) \cap\left(\left(z_{n}\right)_{n \geq 2} \in S L i M S_{w}\right) \end{aligned}\\ Let ~z_{n} = z_{n}^{\prime}~ and ~z_{n} = z_{n}^{\prime \prime}.\\ Rewriting\\ ~~~~~~~~~~~~~~~~~~~~~~~~\begin{aligned} & = \left(\left(z_{n}^{\prime}\right)_{n \geq 2} \in p T G 4_{i j}\right) \cap\left(\left(z_{n}^{\prime \prime}\right)_{n \geq 2} \in S L i M S_{w}\right)\\ &\begin{array}{l} { = \left(\left(z_{n}^{\prime}\right)_{n \geq 2}, \left(z_{n}^{\prime \prime}\right)_{n \geq 2}\right)} \\ { = p T G 4_{i j} \times S L i M S_{w}} \end{array} \end{aligned}$

𝒁 : = 𝑆𝑒𝑡 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠 (𝑧_𝑛 ∈ 𝒁)

𝑝𝑇𝐺4_𝑖𝑗 : = 𝐶𝑎𝑛𝑜𝑛𝑖𝑐𝑎𝑙 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑓𝑜𝑟𝑚 𝑜𝑓 𝑷𝑻𝑮𝟒

𝑺𝑳𝒊𝑴𝑺 : = 𝑆𝑒𝑡 𝑜𝑓 𝑠ℎ𝑜𝑟𝑡 𝑙𝑖𝑛𝑒𝑎𝑟 𝑚𝑜𝑡𝑖𝑓𝑠 (𝑆𝐿𝑖𝑀𝑆_𝑤 ∈ 𝑺𝑳𝒊𝑴𝑺)

𝑖, 𝑗, 𝑛, 𝑤 : = 𝐼𝑛𝑑𝑖𝑐𝑒𝑠 𝑜𝑓 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 𝑜𝑓 𝒁, 𝑷𝑻𝑮𝟒, 𝑺𝑳𝒊𝑴𝑺

2.2. Statistical measures to compute and assess biological relevance of TG4

The indices utilized by this study to establish relevance of matched instances of various motifs/co-motifs in the peptide/protein sequences of interest include the accuracy (A), precision (P), recall (R) and the p-value. A 2X2 table which represents the categorized data (2.1.4) is constructed and used to compute various bioinformatics indices. This is outlined as under:

𝑇𝑁 ≔ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (|𝒔𝑵𝑬𝑮 ∩ 𝑷𝑵𝑬𝑮| = |𝒔𝑵𝑬𝑮|) ≡ 𝑆𝐿⁻𝑃𝑇⁻ (𝐷𝑒𝑓. 11)

𝐹𝑃 ≔ 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (|𝑺𝑵𝑬𝑮 ∩ 𝑷𝑷𝑶𝑺| = |𝑺𝑵𝑬𝑮|) ≡ 𝑆𝐿⁻𝑃𝑇⁺ (𝐷𝑒𝑓. 12)

𝐹𝑁 ≔ 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (|𝒔𝑷𝑶𝑺 ∩ 𝑷𝑵𝑬𝑮| = |𝒔𝑷𝑶𝑺|) ≡ 𝑆𝐿⁺𝑃𝑇⁻ (𝐷𝑒𝑓. 13)

𝑇𝑃 ≔ 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (|𝑺𝑷𝑶𝑺 ∩ 𝑷𝑷𝑶𝑺| = |𝑺𝑷𝑶𝑺|) ≡ 𝑆𝐿⁺𝑃𝑇⁺ (𝐷𝑒𝑓. 14)

The equations may then be written as:

$(A) = (^{T N+T P}/_{T N+F P+F N+T P}) X 100$

(4)

$(P) = (^{T P} /_{F P+T P}) X 100$

(5)

$(R) = (^{T P} /_{F N+T P}) X 100$

(6)

The p-values for these analyses are computed by comparing the frequency of occurrence of all pTG4_ij in a test sequence (ϕ_{pTG4_ij}) with the same in randomly-generated (v∈V) sequences of similar lengths (ϕ_{pTG4_vij}), i.e., 7-50 aa (1≤v≤10000) and > 50 aa (1≤v≤100000) (Pseudocode, PS2: Supplementary Text 1):

$\begin{aligned} p-{value} & = \phi_{p T G 4_{v i j}} / \phi_{p T G 4_{i j}} \\ & = \left(\sum\limits_{v = 1}^{v = |V|} \sum\limits_{i = 7}^{i = 21} \sum\limits_{j = 1}^{j = J} p T G 4_{v i j}\right) /\left(\sum\limits_{i = 7}^{i = 21} \sum\limits_{j = 1}^{j = J} p T G 4_{i j}\right) \\ & = \left(\sum\limits_{v = 1}^{v = |V|} \sum\limits_{i = 7}^{i = 21} \sum\limits_{j = 1}^{j = J} p T G 4_{v i j} / \sum\limits_{i = 7}^{i = 21} \sum\limits_{j = 1}^{j = J} p T G 4_{i j}\right) \end{aligned}$

(7)

The frequency of occurrence of overlapping sequences of amino acids ((z_n)_n≥2∈(pTG4_ij∩SLiMS_w); z_n∈Z) in pre-compiled and curated protein sequences (ϕ_{(z_n)}) across taxa is compared with randomly chosen sequences of comparable lengths (ϕ_{(vz_n)}; n = 5000). These are used to estimate statistical significance, i.e., p-value = ϕ_{(vz_n)}/ϕ_{(z_n)} (8).

3. Results

The data presented discusses implementation of a model of short intra-strand TG4 for various values of α and β, populates PTG4 and establishes the equivalence TG4~PTG4. Co-occurrence and homology studies between PTG4 and the SLiMS in IDRs/IDPs and generic protein sequences across taxa are used to infer probable mechanisms of TG4~PTG4 facilitated misfolding-induced proteostasis.

3.1. Suitability of Guanine-containing codons as a model for an arbitrary G-rich cluster of TG4

An association-competent codon not only takes into account the presence of a Guanine residue, but also gives weightage to its position (Figures 1–3, Table 1). This schema partitions standard vertebrate codons into those with a high- (Ranks 1-4;α > 0.0000) or low- (Rank 5;α = 0.0000) propensity to form a contiguous cluster of Guanine residues (Figures 3 and 4, Table 1). Whilst, 'GGG' (Rank 1;α = 2.12) can associate with ({GGG, GxG, xGG, GGx, Gxx, xxG}) bilaterally (δ = 6;Ω = 2), 'GxG' (Rank 2; α = 2.0066) can do so only with 'GGG' (δ = 1;Ω = 2). On the other hand, the codon subsets 'GGx' and 'xGG' (Rank 3;α = 1.0132) can form two clusters of contiguous Guanine residues with 'GGG' and 'xGG^'/ 'GGx' unilaterally (δ = 2;Ω = 1). Similarly, the subsets 'xxG' or 'Gxx' (Rank 4;α = 1.0022), can form contiguous Guanines with a single occurrence of 'GGG' (δ = 1;Ω = 1) (Figures 3 and 4, Table 1). Conversely, codons with either a single occurrence of a central Guanine residue 'xGx' or no Guanine residues 'xxx' (Rank 5;α = 0.0000) are unable to form the 'GGGG' and are excluded from this study (Figures 3 and 4, Table 1).

Table 1. Rank wise arrangement of codon scores for the non-trivial (4≤|G|≤9) TG4.

Rank	Codon set, Cardinality	Codon	γ	θ	δ	Ω	α=γ.θ.δ+Ω	aa
1	GGG, 1	GGG	0.02	1.00	6	2	2.1200	Gly
2	GxG, 3	GUG	0.02	0.33	1	2	2.0066	Val
		GCG	0.02	0.33	1	2	2.0066	Ala
		GAG	0.02	0.33	1	2	2.0066	Glu
3	xGG, 3	UGG	0.02	0.33	2	1	1.0132	Trp
		CGG	0.02	0.33	2	1	1.0132	Arg
		AGG	0.02	0.33	2	1	1.0132	Arg
3	GGx, 3	GGU	0.02	0.33	2	1	1.0132	Gly
		GGC	0.02	0.33	2	1	1.0132	Gly
		GGA	0.02	0.33	2	1	1.0132	Gly
4	xxG, 9	UUG	0.02	0.11	1	1	1.0022	Leu
		UCG	0.02	0.11	1	1	1.0022	Ser
		UAG	0.02	0.11	1	1	1.0022	Ter
		CUG	0.02	0.11	1	1	1.0022	Leu
		CCG	0.02	0.11	1	1	1.0022	Pro
		CAG	0.02	0.11	1	1	1.0022	Gln
		AUG	0.02	0.11	1	1	1.0022	Met
		ACG	0.02	0.11	1	1	1.0022	Thr
		AAG	0.02	0.11	1	1	1.0022	Lys
4	Gxx, 9	GUU	0.02	0.11	1	1	1.0022	Val
		GCU	0.02	0.11	1	1	1.0022	Ala
		GAU	0.02	0.11	1	1	1.0022	Asp
		GUC	0.02	0.11	1	1	1.0022	Val
		GCC	0.02	0.11	1	1	1.0022	Ala
		GAC	0.02	0.11	1	1	1.0022	Asp
		GUA	0.02	0.11	1	1	1.0022	Val
		GCA	0.02	0.11	1	1	1.0022	Ala
		GAA	0.02	0.11	1	1	1.0022	Glu
5	xGx, 9	UGU	0.02	0.11	0	0	0.0000	Cys
		UGC	0.02	0.11	0	0	0.0000	Cys
		UGA	0.02	0.11	0	0	0.0000	Ter
		CGU	0.02	0.11	0	0	0.0000	Arg
		CGC	0.02	0.11	0	0	0.0000	Arg
		CGA	0.02	0.11	0	0	0.0000	Arg
		AGU	0.02	0.11	0	0	0.0000	Ser
		AGC	0.02	0.11	0	0	0.0000	Ser
		AGA	0.02	0.11	0	0	0.0000	Arg
5	xxx, 27	UUU	0.02	0.04	0	0	0.0000	Phe
		UCU	0.02	0.04	0	0	0.0000	Ser
		UAU	0.02	0.04	0	0	0.0000	Tyr
		UUC	0.02	0.04	0	0	0.0000	Phe
		UCC	0.02	0.04	0	0	0.0000	Ser
		UUA	0.02	0.04	0	0	0.0000	Leu
		UCA	0.02	0.04	0	0	0.0000	Ser
		UAA	0.02	0.04	0	0	0.0000	Ter
		CUU	0.02	0.04	0	0	0.0000	Leu
		CCU	0.02	0.04	0	0	0.0000	Pro
		CAU	0.02	0.04	0	0	0.0000	His
		CUC	0.02	0.04	0	0	0.0000	Leu
		CCC	0.02	0.04	0	0	0.0000	Pro
		CAC	0.02	0.04	0	0	0.0000	His
		CUA	0.02	0.04	0	0	0.0000	Leu
		CCA	0.02	0.04	0	0	0.0000	Pro
		CAA	0.02	0.04	0	0	0.0000	Gln
		AUU	0.02	0.04	0	0	0.0000	Ile
		ACU	0.02	0.04	0	0	0.0000	Thr
		AAU	0.02	0.04	0	0	0.0000	Asn
		AUC	0.02	0.04	0	0	0.0000	Ile
		ACC	0.02	0.04	0	0	0.0000	Thr
		AAC	0.02	0.04	0	0	0.0000	Asn
		AUA	0.02	0.04	0	0	0.0000	Ile
		ACA	0.02	0.04	0	0	0.0000	Thr
		AAA	0.02	0.04	0	0	0.0000	Lys

| Show Table

DownLoad: CSV

Abbreviations

𝛾: General probability of a codon (𝛾 = 1/64 ≅ 0.02)

𝜃: Probability of codon within a group (𝜃 = {0.04, 0.33, 0.11, 1.00})

𝛿: Number of distinct codon sets that could complete 'GGGG' (𝛿 = {0, 1, 2, 6})

Ω: Number of adjacent positions that contain 𝛿 (Ω = {0, 1, 2})

𝛼: Threshold for selecting codons that may favour G-quadruplex formation

x: Codon specific generic ribonucleotide {𝐴, 𝐺, 𝑈, 𝐶}

aa: Amino acid

Ter: Stop codons {𝑈𝐴𝐺, 𝑈𝐺𝐴, 𝑈𝐴𝐴}

3.2. Validation studies of PTG4 in known G4-forming exons to establish equivalence (TG4~PTG4)

An estimate of the possible combinations of the simplest peptide ( $\sum_{i = 7} \sum_{j = 1}^{j = J} p T G 4_{i j}$ = 8.00E+03;GlyzGlyzGlyzGly; J = (20)³; i = lengthp(TG4_ij) = 7 aa; z∈Z) (Figures 3 and 4, Table 2). This justifies usage of PTG4 (pTG4_ij∈PTG4) as a generic representation of the putative peptidome encoded by the TG4 (PTG4). Approximately ~12% (n = 11) of in silico translated amino acid sequences from exon-derived TG4 possesses one of more "STOP" signals and include ESR1, longer RNA variants of PRNP (85 nt) and BCL2 (29 n t, 33 nt, 34 nt) (Table 3; Table 1, Supplementary Text 2). With the exceptions of KCNH2/ZNF669 and the shorter variants of PRNP (14 nt, 15 nt, 20 nt, 24 nt), "VALID" sub sequences are found for BACE1, BCL2, ESR1, PRNP (long) and TERF2 (Table 3; Tables 1A and 1C). Interestingly, all the genes considered possessed at least one occurrence of PTG4 (P = 100%, n = 6) (Table 3; Table 1B). This finding, despite the small sample size is proof-of-principal that the TG4 can be mapped to definite peptide sequences, i.e., TG4~PTG4. Since this can occur only after a ribosomal read through of the G4 containing mRNA, it raises the intriguing possibility that PTG4 whence part of a larger protein may increase its propensity to undergo misfolding. This notion is investigated in non-redundant sequences of IDRs (PTG4~10%, n = 145;0.00≤p-value≤0.20) and IDPs (PTG4~34%, n = 269;0.00≤p-value < 0.5) (Table 4; Tables 2 and 3).

Table 2. Codon-based classification of amino acids.

	aa	COD_amino	$\boldsymbol{gCOD}_{\boldsymbol {amino }}^{+}$	β
Group 1 (n=7)	Ala	4	4	1.00
	Val	4	4	1.00
	Asp	2	2	1.00
	Glu	2	2	1.00
	Trp	1	1	1.00
	Met	1	1	1.00
	Gly	4	4	1.00
Group 2 (n=7)	Leu	6	2	0.3333
	Gln	2	1	0.5
	Arg	6	2	0.3333
	Lys	2	1	0.5
	Ser	6	1	0.1667
	Thr	4	1	0.25
	Pro	4	1	0.25
Group 3 (n=6)	Cys	2	0	0.00
	Asn	2	0	0.00
	Ile	3	0	0.00
	His	2	0	0.00
	Phe	2	0	0.00
	Tyr	2	0	0.00

| Show Table

DownLoad: CSV

Abbreviations

$\boldsymbol{gCOD}_{\boldsymbol {amino }}^{+}$ : Guanine-containing optimal codons excluding STOP (UAG) (𝛼 > 0.000)

$\boldsymbol{COD}_{\boldsymbol {amino }}^{-}$ : Non-optimal codon excluding STOP (UGA, UAA) (𝛼 = 0.000)

𝑪𝑶𝑫_{𝒂𝒎𝒊𝒏𝒐} = $\boldsymbol{gCOD}_{\boldsymbol {amino }}^{+}$ + $\boldsymbol{COD}_{\boldsymbol {amino }}^{-}$ : All codons for an amino acid

Table 3. Genes (Homo sapiens) with G-quadruplex forming mRNA segments derived from one or more exons (Ex).

GENE	NAME	G4 (nt)	Ex	STOP(n=11)	VALID(n=59)	*\|PTG4\|*
BACE1	Beta-secretase 1	33	3	n=0	n=6	n=2
BCL2	B-cell lymphoma 2	33	2	n=1	n=6	n=1
		23		n=0	n=6
		28		n=0	n=6
		29		n=1	n=5
		34		n=1	n=5
		33	3	n=2	n=5
ESR1	Estrogen receptor alpha (ERα)	36	4	n=1	n=5	n=2
KCNH2	Potassium Voltage-Gated Channel sub family H	18	12	n=0	n=0	NA
ZNF669	Member 2 Zinc Finger Protein 669		1
PRNP	Prion protein	14	2	n=0	n=0	n=1
		15		n=0	n=0
		20		n=0	n=0
		24		n=0	n=6
		85		n=6	n=3
TERF2	Telomeric repeat-binding factor 2	55	1	n=0	n=6	n=1

| Show Table

DownLoad: CSV

Table 4. Co-occurrence data for PTG4 and known SLiMS.

	Disordered regions (IDRs; n=1445;0.00≤p-value < 0.05)
	SL^-PT^-	SL^-PT⁺	SL⁺PT^-	SL⁺PT⁺	R₁T	R₂T	C₁T	C₂T	A (%)	P (%)	R (%)
SLiMS₁	1078	64	58	9	1142	67	1136	73	89.90	12.32	13.43
SLiMS₂	749	18	121	29	767	150	870	47	84.84	61.70	19.33
SLiMS₃	1212	108	34	9	1320	43	1246	117	89.58	7.69	20.93
	Proteins with disordered segments (IDPs; n=800;0.00≤p-value < 0.05)
	SL^-PT^-	SL^-PT⁺	SL⁺PT^-	SL⁺PT⁺	R₁T	R₂T	C₁T	C₂T	A (%)	P (%)	R (%)
SLiMS₁	86	12	28	18	98	46	114	30	72.22	60.00	39.10
SLiMS₂	1	1	96	66	2	162	97	67	40.85	98.50	40.74
SLiMS₃	250	57	26	25	307	51	276	82	76.81	30.48	49.01

| Show Table

DownLoad: CSV

Abbreviations

𝐼𝐷𝑅𝑠: Intrinsically disordered regions

𝐼𝐷𝑃s: Intrinsically disordered proteins

𝑧: Any amino acid

𝑆𝐿𝑖𝑀𝑆₁: [𝑆𝑇]𝑃𝑧𝑅

𝑆𝐿𝑖𝑀𝑆₂: [𝐸𝐷]𝑧𝑧[𝐷𝐸][𝐴𝐺𝑆]

𝑆𝐿𝑖𝑀𝑆₃: [𝐾𝑅]𝑧𝑃𝑧𝑧𝑃

𝑆𝐿⁻𝑃𝑇⁻: |𝑺𝑵𝑬𝑮 ∩ 𝑷𝑵𝑬𝑮|

𝑆𝐿⁻𝑃𝑇⁺: |𝑺𝑵𝑬𝑮 ∩ 𝑷𝑷𝑶𝑺|

𝑆𝐿⁺𝑃𝑇⁻: |𝑺𝑷𝑶𝑺 ∩ 𝑷𝑵𝑬𝑮|

𝑆𝐿⁺𝑃𝑇⁺: |𝑺𝑷𝑶𝑺 ∩ 𝑷𝑷𝑶𝑺|

𝑅₁𝑇: 𝑆𝐿⁻𝑃𝑇⁻ + 𝑆𝐿⁻𝑃𝑇⁺

𝑅₂𝑇: 𝑆𝐿⁺𝑃𝑇⁻ + 𝑆𝐿⁺𝑃𝑇⁺

𝐶₁𝑇: 𝑆𝐿⁻𝑃𝑇⁻+ 𝑆𝐿⁺𝑃𝑇⁻

𝐶₂𝑇: 𝑆𝐿⁻𝑃𝑇⁺ + 𝑆𝐿⁺𝑃𝑇⁺

𝐴: Accuracy

𝑃: Precision

𝑅: Recall

3.3. The peptidome of the translatable G-quadruplex may trigger misfolding of the encompassing protein

The amino acids that comprise the peptide members of PTG4 and the short linear motifs (g1, g2 vs SLiMS) are well conserved. The co-occurrence of PTG4 with SLiMS in the IDRs (A~85-89%;0.00 < p-value≤0.05) suggests that this association is non-trivial and may favor all purported mechanisms of misfolding (hyperphosphorylation, proteolytic cleavage, complex formation) (Table 4; Tables 2 and 3). However, the higher precision of PTG4 with the proteolytic-SLiMS suggests that this may predominate (Table 4; Tables 2 and 3). The data with the IDPs suggests a similar predilection for proteolytic cleavage (A~40-77%;P~99%;0.00 < p-value < 0.05, although hyperphosphorylation (P~60%;0.00 < p-value < 0.05) and complex-promotion (P~30%;0.00 < p-value < 0.05) may constitute viable alternatives to the genesis of misfolding (Table 4; Tables 2 and 3). The presence of overlapping sequences of amino acids between PTG4 and the SLiMS when examined in protein sequences from taxonomically diverse organisms is degenerate for SLiMS₁ (number of matches = 6251) and SLiMS₃ (number of matches = 1480) (Table 5; Table 4). In contrast, the corresponding data for SLiMS₂ (number ofmatches = 3759;0.00 < p-value < 0.05) is statistically significant (Table 5). The taxonomic spread includes archaea (n = 150), bacteria (n = 1735), viruses (n = 84), green land plants (n = 199), fungi (n = 182), eukaryotic invertebrates (n = 43) and vertebrates (n = 700) (Table 4).

Table 5. Occurrence of overlapping amino acids of PTG4 and known SLiMS in curated full length protein sequences.

SLiMS	Sample	*(z_n)_n≥2∈pTG4_ij∩SLiMS_w* (p-value)**
**SLiMS₁=[ST]PzR**	Pz	PG (n=1)	PG (Degenerate)
SLiMS₂=[ED]zzD[AGS]	z[DE]	G[DE] (n=2)	[WGRVAELMKQSTP][AG][DE]z2EG[VADE](p-value=0.00069)
	[DE]z	[DE]G (n=1)
	zz[DE]	[LMKQSTP]G[DE] (n=14) [VAE]G[DE] (n=6) [WGR][AG][DE] (n=6)
	[DE]zz[DE]	[VAE]G[DE]zzEG[VADE] (n=28) [WGR][AG][DE]zzEG[VADE] (n=24) [WGR][AG][DE]zzEG (n=6) GEzzEG[VADE] (n=4) GEzzEG (n=1)
**SLiMS₃=[KR]zPzzP**	Pzz	PG[VADE] (n=4)	PGV (Degenerate) PGA (Degenerate) PGD (Degenerate) PGE (Degenerate)

| Show Table

DownLoad: CSV

Abbreviations

𝑝𝑇𝐺4_𝑖𝑗: Members of putative peptidome (𝑝𝑇𝐺4_𝑖𝑗 ∈ 𝑷𝑻𝑮𝟒)

𝑆𝐿𝑖𝑀𝑆_𝑤: Short linear motifs (𝑆𝐿𝑖𝑀𝑆_𝑤 ∈ 𝑺𝑳𝒊𝑴𝑺)

𝑧_𝑛: Shared sequence(s) of amino acids between 𝑷𝑻𝑮𝟒 and 𝑺𝑳𝒊𝑴𝑺

𝑖, 𝑗, 𝑤, 𝑛: Indices to characterize members of 𝑷𝑻𝑮𝟒, 𝑺𝑳𝒊𝑴𝑺, 𝒁

4. Discussion

The significant association and homology between PTG4 and the SLiMS along with the equivalence data (PTG4~TG4) suggest that TG4 may influence proteostasis in a multitude of ways (Tables 1–5; Tables 1–4, Supplementary Text 2–4).

4.1. TG4 may effect stability of mRNA and indirectly influence proteostasis

The short TG4 modeled in this study has an average loop length (h~2 Mer) which may contribute to thermodynamic stability by restricting the mobility of the participating strands (1) ^[8,9,10,11]. The physical presence of TG4 will result in a stalled ribosome and translation which is prolonged, inefficient and incomplete ^[31,32,33]. Interestingly, this analysis also includes UAG (Amber; α > 0.0000), which when present in-frame will prematurely terminate translation and result in a truncated protein (Table 1) ^[39]. Whilst nonsense-mediated mRNA decay may be triggered if the stop codon is within ±50 Mer of the exon-junction complex (EJC), a read-through may occur nonetheless. The resulting protein sequences may be modified which in tandem with one or more occurrences of PTG4 and/or SLiMS would predispose the same to agammaegate and result in a proteopathy ^[39,40].

4.2. Mechanism(s) of PTG4-mediated misfolding

Whilst the preponderance of Glycine (Gly) might impart heightened flexibility and limit the formation of stabilizing secondary structural elements in the hypothetical protein, Proline (Pro) confers rigidity and may retard proper folding. There is also remarkable conservation between the amino acids that comprise PTG4 and the SLiMS. These include the complex-promoting hydrophobic (Ala, Val, Met, Trp) and ionic (Asp, Glu, Lys, Arg) residues, along with nucleophile-favoring Serine and Threonine (Figures 3 and 4, Tables 2–5). Whilst, the former may favor agammaegation by non-covalent interactions, the latter may promote phosphorylation-mediated charge imbalance and thence misfolding. Interestingly, the loops of G4 whence modeled by Adenine-containing codons (Axx) are translated to Lysine (K), Arginine (R), Serine (S), Threonine (T) and Isoleucine (I); all of which may also promote misfolding (Figures 3 and 4, Tables 2-5) ^{[8,9,10,11,34,35]}. The distribution of PTG4 amongst physiologically relevant proteins further suggests that the peptide-mediated misfolding may influence/regulate signal transduction, cytoskeleton organization, metabolism, synaptic transmission and transcription/translation (Table 6; Table 5).

Table 6. Proteins encompassing PTG4 as candidates for motif mimicry. ^*.

	Cellular function	Disordered regions of proteins
1.	Signal transduction	DP00274, DP00224, DP00141, DP00332, DP01063, DP00506, DP00418, DP00341, DP00435, DP00613, DP00463, DP00954, DP00959, DP01104, DP00611, DP00519, DP00086, DP00707, DP00712
2.	Endocytosis	DP01073, DP01065, DP01066, DP00225
3.	Calcium-calmodulin	DP00092, DP00132, DP00561, DP00118, DP00253
4.	Myofibril assembly	DP01090
5.	Cytoskeleton	DP01056 DP00240, DP01022, DP00169, DP00716, DP00717, DP01100, DP00122
6.	Nuclear pore	DP01075, DP01077, DP01079
7.	Phototransduction	DP00768, DP00347
8.	Targeting	DP00893, DP00609, DP00610, DP01058
9.	Transcription	DP00062, DP00177, DP00633, DP00348, DP00786, DP00049, DP00231, DP00873, DP00720, DP00217, DP00081
10.	Translation	DP00082, DP00164, DP00229 DP00949, DP00134
11.	Synaptic transmission	DP00943
12.	Supercoiling	DP00076
13.	Binding	DP00539, DP00854, DP01052, DP00659, DP00656
14.	Peptide bond formation	DP00944
15.	Enzymes	DP00557, DP00032, DP00095, DP00337, DP00379, DP00787, DP00427, DP00429
16.	Bacterial/parasitic virulence
	Secreted toxins	DP00345, DP00591
	Cytoadherence	DP00025, DP00065, DP01096
17.	Viral infectivity
	Cyclophilin interaction	DP00615, DP01031
	Chaperones	DP00699, DP00700, DP00674
	Capsid assembly	DP00133, DP00876
	Membrane fusion	DP01043
	Latency	DP01060
18.	Unknown	DP00119

| Show Table

DownLoad: CSV

Note: DP≔DisProt ID

4.3. Degeneracy of PTG4 with SLiMS in non-vertebrate taxa may favor development of secondary proteopathies

The distribution of overlapping/shared amino acids in protein sequences of non-vertebrates suggests that PTG4 is either completely degenerate with the SLiMS or present in proportions that is statistically significant (Tables 5 and 6; Tables 4 and 5). These data imply that motif-mimicry too, might constitute a probable cause (tropism, oncogenic potential, virulence) of infection/infestation-mediated acute/chronic proteopathies ^{[34,35,41,42]}. The contribution(s) of misfolding to the pathogenesis of secondary proteopathies is however, debatable. Whilst, there is evidence that mislocalization of proteins can precipitate misfolding, mimicry itself may result in exonuclease-mediated proteolytic cleavage and thence trigger an infective proteopathy ^[43,44]. Additionally, the presence of sequences of amino acids such as Proline and Threonine in viral or fungal proteins may be responsible for creating and/or maintaining a milieu conducive to the genesis of infective/transmissible proteopathies, viz., a high charge density and imbalance of electrostatic interactions ^[43,44].

5. Conclusions

The coexistence of potentially translatable G-quadruplexes (TG4) with unfolded ribonucleotides in the PCS of an mRNA transcript may have important consequences for protein homeostasis. Here, I have investigated the contribution of a short intra-strand translatable G-quadruplex and its associated peptidome (TG4~PTG4) to the genesis of misfolding-induced proteostasis. The co-occurrence, homology and distribution of overlapping/shared amino acids of PTG4 with the SLiMS suggests that this may occur by truncation, complex formation, increased charge density and/or accelerated degradation. An additional mechanism that is also supported is motif-mimicry by pathogens which may trigger the development of infective proteopathies. The putative peptidome (~7–20 aa) that corresponds to the short translatable G-quadruplex delineated by this investigation may be utilized as novel markers of both the primary and secondary proteopathies.

Author's contribution

SK outlined and designed the study, designed and conceptualized the algorithm(s) and formulae for prediction, wrote mathematical proofs to establish rigor, collated the data, constructed the models, formulated the filters, carried out the computational analysis, wrote all necessary code and the manuscript.

Conflict of interest

The author declares no conflict of interest.

References

[1]	Kumar V, Singh R, Quraishi MA (2013) A study on corrosion of reinforcement in concrete and effect of inhibitor on service life of RCC. J Mater Environ Sci 4: 726–731.
[2]	Gu XL, Zhang WP, Shang DF, et al. (2010) Flexural Behavior of corroded Reinforced Concrete Beams, In: Song GB, Malla RB, Earth and Space 2010: Engineering, Science, Construction and Operations in Challenging Environments, 3545–3552.
[3]	Torres-Acosta AA, Navarro-Gutierrez S, Terán-Guillén J (2007) Residual flexure capacity of corroded reinforced concrete beams. Eng Struct 29: 1145–1152. doi: 10.1016/j.engstruct.2006.07.018
[4]	Mangat PS, Elgarf MS (1999) Flexural strength of concrete beams with corroding reinforcement. ACI Struct J 96: 149–158.
[5]	Al-Sulimani GJ, Kaleemullah M, Basunbul IA, et al. (1990) Influence of corrosion and cracking on bond behavior and strength of reinforced concrete members. ACI Struct J 87: 220–231.
[6]	Yuan YS, Jia FP, Cai Y (2004) The structural behavior deterioration model for corroded reinforced concrete beams. China Civil Eng J 34: 47–52.
[7]	Zhang WP, Shang DF, Gu XL (2006) Stress-strain relationship of corroded steel bars. J Tongji Univ-Nat Sci 34: 586–592.
[8]	Dhawan S, Bhalla S, Bhattacharjee B (2014) Reinforcement corrosion in concrete structures and service life predictions-A Review. 9th International Symposium on Advanced Science and Technology in Experimental Mechanics, New Delhi, India.
[9]	Ahmad S (2003) Reinforcement corrosion in concrete structures, its monitoring and service life prediction-a review. Cement Concrete Comp 25: 459–471. doi: 10.1016/S0958-9465(02)00086-0
[10]	Wang XH, Liu XL (2010) Simplified methodology for the evaluation of the Residual strength of corroded reinforced concrete beams. J Perform Constr Fac 24: 108–119. doi: 10.1061/(ASCE)CF.1943-5509.0000083
[11]	Wang XH, Liu XL (2006) Bond strength modelling for corroded reinforcement. Constr Build Mater 20: 177–186. doi: 10.1016/j.conbuildmat.2005.01.015
[12]	Torres-Acosta AA, Navarro-Gutierrez S, Terán-Guillén J (2006) Residual Flexure capacity of corroded reinforced concrete beams. Eng Struct 29: 1145–1152.
[13]	Xia J, Jin WL, Li LY (2012) Effect of chloride-induced reinforcing steel corrosion on the flexural strength of reinforced concrete beams. Mag Concrete Res 64: 471–485. doi: 10.1680/macr.10.00169
[14]	Wang W, Chen J (2011) Residual Strengths of Reinforced Concrete Beams with Heavy Deterioration. Res J Appl Sci Eng Technol 2: 798–805.
[15]	Tachibana Y, Maeda KI, Kajikawa Y, et al. (1990) Mechanical behaviour of RC beams damaged by corrosion of reinforcement. Proceedings of the 3rd International Symposium on Corrosion of Reinforcement in Concrete Construction, Wishaw, 178–187.
[16]	Imam A, Azad AK (2016) Prediction of residual shear strength of corroded reinforced concrete beams. Int J Adv Struct Eng 8: 307–318.
[17]	Cairns J, Du Y, Law D (2008) Structural performance of corrosion-damaged Concrete beams. Mag Concrete Res 60: 359–370. doi: 10.1680/macr.2007.00102
[18]	Ahmad S (2017) Prediction of residual flexural strength of corroded reinforced concrete beams. Anti-Corros Method M 64: 69–74. doi: 10.1108/ACMM-11-2015-1599
[19]	Ortega NF, Robles SI (2016) Assessment of residual life of concrete structures affected by reinforcement corrosion. HBRC J 12: 114–122. doi: 10.1016/j.hbrcj.2014.11.003
[20]	Shannag MJ, Al-Ateek SA (2006) Flexural behaviour of strengthened concrete Beams with corroding reinforcement. Constr Build Mater 20: 834–840. doi: 10.1016/j.conbuildmat.2005.01.059
[21]	Hawileh RA, Abdalla JA, Al Tamimi A, et al. (2011) Behaviour of corroded steel reinforcing bars under monotonic and cyclic loadings. Mech Adv Mater Struc 18: 218–224. doi: 10.1080/15376494.2010.499023
[22]	Jin WL, Zhao YX (2001) Effect of corrosion on bond behaviour and bending strength of reinforced concrete beams. J Zhejiang Univ-Sc A 2: 298–308. doi: 10.1631/jzus.2001.0298
[23]	Azad AK, Ahmad S, Azher SA (2007) Residual strength of corrosion damaged reinforced concrete beams. ACI Mater J 104: 40–47.
[24]	Wang XH, Liu XL (2008) Modelling the flexural carrying capacity of corroded RC beams. J Shanghai Jiaotong Univ Sci 13: 129–135. doi: 10.1007/s12204-008-0129-1
[25]	Ahmad S (2014) An experimental study on correlation between concrete resistivity and reinforcement corrosion rate. Anti-Corros Method M 61: 158–165. doi: 10.1108/ACMM-07-2013-1285
[26]	Torres-Acosta AA, Martı´nez-Madrid M (2003) Residual life of corroding reinforced concrete structures in marine environment. J Mater Civil Eng 15: 344–353. doi: 10.1061/(ASCE)0899-1561(2003)15:4(344)
[27]	Imam A, Anifowose F, Azad AK (2015) Residual strength of corroded reinforced concrete beams using an Adaptive Model based on ANN. Int J Concr Struct M 9: 159–172. doi: 10.1007/s40069-015-0097-4
[28]	Abdalla JA, Elsanosi A, Abdelwahab A (2007) Modelling and simulation of shear resistance of R/C beams using artificial neural network. J Franklin I 344: 741–756. doi: 10.1016/j.jfranklin.2005.12.005
[29]	Wu X, Ghaboussi J, Garrett JH (1992) Use of neural networks in detection of structural damage. Comput Struct 42: 649–659. doi: 10.1016/0045-7949(92)90132-J
[30]	Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE T Evolut Comput 1: 67–82. doi: 10.1109/4235.585893
[31]	Rafiq MY, Bugmann G, Easterbrook DJ (2001) Neural network design for engineering applications. Comput Struct 79: 1541–1552. doi: 10.1016/S0045-7949(01)00039-6
[32]	Bai J, Wild S, Ware JA, et al. (2003) Using neural networks to predict workability of concrete incorporating metakaolin and fly ash. Adv Eng Softw 34: 663–669. doi: 10.1016/S0965-9978(03)00102-9
[33]	Oreta AWC, Kawashima K (2003) Neural network modelling of confined compressive strength and strain of circular concrete columns. J Struct Eng 129: 554–561. doi: 10.1061/(ASCE)0733-9445(2003)129:4(554)
[34]	Erdem H (2010) Prediction of the moment capacity of reinforced concrete slabs in fire using artificial neural networks. Adv Eng Softw 41: 270–276. doi: 10.1016/j.advengsoft.2009.07.006
[35]	Tsai CH, Hsu DS (2002) Diagnosis of reinforced concrete structural damage base on displacement time history using the back-propagation neural network technique. J Comput Civil Eng 16: 49–58. doi: 10.1061/(ASCE)0887-3801(2002)16:1(49)
[36]	Cabrera JG (1996) Deterioration of concrete due to reinforcement steel corrosion. Cement Concrete Comp 18: 47–59. doi: 10.1016/0958-9465(95)00043-7
[37]	Inel M (2007) Modelling ultimate deformation capacity of RC columns using artificial neural networks. Eng Struct 29: 329–335. doi: 10.1016/j.engstruct.2006.05.001
[38]	Adhikary BB, Mutsuyoshi H (2006) Prediction of shear strength of steel fiber RC beams using neural networks. Constr Build Mater 20: 801–811. doi: 10.1016/j.conbuildmat.2005.01.047
[39]	Sharifi Y, Tohidi S (2014) Ultimate capacity assessment of web plate beams with pitting corrosion subjected to patch loading by Artificial Neural Networks. Adv Steel Constr 10: 325–350.
[40]	Abdalla JA, Hawileh R (2011) Modeling and simulation of low-cycle fatigue life of steel reinforcing bars using artificial neural network. J Franklin I 348: 1393–1403. doi: 10.1016/j.jfranklin.2010.04.005
[41]	Naser M, Abu-Lebdeh G, Hawileh R (2012) Analysis of RC T-Beams strengthened with CFRP plates under fire loading using ANN. Constr Build Mater 37: 301–309. doi: 10.1016/j.conbuildmat.2012.07.001
[42]	Abdalla JA, Hawileh RA (2013) Artificial Neural Network predictions of fatigue life of steel bars based on hysteretic energy. J Comput Civil Eng 27: 489–496. doi: 10.1061/(ASCE)CP.1943-5487.0000185
[43]	Abdalla JA, Saqan EI, Hawileh RA (2014) Optimum seismic design of unbonded post-tensioned precast concrete walls using ANN. Comput Concrete 13: 547–567. doi: 10.12989/cac.2014.13.4.547
[44]	Alqedra MA, Ashour AF (2005) Prediction of shear capacity of single anchors located near a concrete edge using neural networks. Comput Struct 83: 2495–2502. doi: 10.1016/j.compstruc.2005.03.019
[45]	Sakthivel PB, Ravichandran A, Alagumurthi N (2016) Modelling and prediction of flexural strength of hybrid mesh and fiber reinforced cement-based composites using Artificial Neural Network (ANN). Int J GEOMATE Geotech Const Mat Env 10: 1623–1635.
[46]	Kumar EP, Sharma EP (2014) Artificial Neural Networks-A Study. Int J Emerg Eng Res Technol 2: 143–148.
[47]	Flood I, Kartam N (1994) Neural networks in civil engineering. II: Systems and application. J Comput Civil Eng 8: 149–162. doi: 10.1061/(ASCE)0887-3801(1994)8:2(149)
[48]	Flood I, Christophilos P (1996) Modelling construction processes using artificial neural networks. Automat Constr 4: 307–320. doi: 10.1016/0926-5805(95)00011-9
[49]	Jeng DS, Cha DH, Blumenstein M (2003) Application of Neural Networks in Civil Engineering problems. Proceedings of the International Conference on Advances in the Internet, Processing, Systems and Interdisciplinary Research.
[50]	Baughman DR (1995) Neural networks in bioprocessing and chemical engineering [PhD Dissertation]. Virginia Tech, Blacksburg, VA.
[51]	Güler İ, Übeylı ED (2005) ECG beat classifier designed by combined neural network model. Pattern Recogn 38: 199–208. doi: 10.1016/j.patcog.2004.06.009
[52]	Li L, Jiao L (2002) Prediction of the oilfield output under the effects of nonlinear factors by artificial neural network. J Xi'an Petrol Inst 17: 42–44.
[53]	Moghadassi AR, Parvizian F, Hosseini SM, et al. (2009) A new approach for estimation of PVT properties of pure gases based on artificial neural network model. Braz J Chem Eng 26: 199–206. doi: 10.1590/S0104-66322009000100019
[54]	Mohaghegh S (1995) Neural network: What it can do for petroleum engineers. J Petrol Technol 47: 42–42. doi: 10.2118/29219-PA
[55]	Chen HM, Tsai KH, Qi GZ, et al. (1995) Neural network for structure control. J Comput Civil Eng 9: 168–176. doi: 10.1061/(ASCE)0887-3801(1995)9:2(168)
[56]	HasançEbi OU, Dumlupınar T (2013) Linear and nonlinear model updating of reinforced concrete T-beam bridges using artificial neural networks. Comput Struct 119: 1–11. doi: 10.1016/j.compstruc.2012.12.017
[57]	Azad AK, Ahmad S, Al-Gohi BHA (2010) Flexural strength of corroded reinforced concrete beam. Mag Concrete Res 62: 405–414. doi: 10.1680/macr.2010.62.6.405
[58]	Beale MH, Hagan MT, Demuth HB (2013) Neural network toolbox^TM user's guide, Natick, MA: The Mathworks Inc.
[59]	Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2 Eds., Berlin, Germany: Springer.

This article has been cited by:

C. Hameni Nkwayep, R. Glèlè Kakaï, S. Bowong, Prediction and control of cholera outbreak: Study case of Cameroon, 2024, 9, 24680427, 892, 10.1016/j.idm.2024.04.009

Reader Comments

Your name:*

Email:*
© 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)