The effects of oil prices on confidence and stock return in China, India and Russia

Melike E. Bildirici; Mesut M. Badur; Melike E. Bildirici; Mesut M. Badur

doi:10.3934/QFE.2018.4.884

Quantitative Finance and Economics

2018, Volume 2, Issue 4: 884-903. doi: 10.3934/QFE.2018.4.884

Previous Article Next Article

Research article

The effects of oil prices on confidence and stock return in China, India and Russia

Melike E. Bildirici ^{1
,
,},
Mesut M. Badur ²

1.
Yildiz Technical University, FEAS, Department of Economics, Davutpasa Campus, Istanbul/Turkey
2.
Yildiz Technical University, Social Sciences Institute, Davutpasa Campus, Istanbul/Turkey

Received: 18 April 2018 Accepted: 26 October 2018 Published: 26 November 2018
JEL Codes: Q43, G12, E70

This study aims to investigate the relationships between oil price, business confidence and stock return for China, India and Russia by employing the Markov Switching Vector Auto Regressive(MS-VAR) and MS-Granger Causality(MS-GC) methods. For China, the causality relationship between business confidence and stock return differ from the results of Russia and India. For China, while there is unidirectional causality from stock return to business confidence for all regimes, in India there is the evidence of bidirectional causality in all regimes. For Russia, there is the evidence of a bidirectional causality between business confidence and stock return in the first regime, while there is none causality in the second regime. In all regimes for the selected countries, there is the evidence of a unidirectional causality from oil price to stock return. But there are different results between oil price and business confidence. The different results obtained for the selected countries are explained by the three different factors. One of the reasons is that the differences in oil reserves of the countries. The other one is the differences in oil demand of the countries' economies. The last one is that the selected countries have different business confidence that can be affected by various parameters of the economy.

Keywords:

Citation: Melike E. Bildirici, Mesut M. Badur. The effects of oil prices on confidence and stock return in China, India and Russia[J]. Quantitative Finance and Economics, 2018, 2(4): 884-903. doi: 10.3934/QFE.2018.4.884

Related Papers:

[1]	Jianhua Jia, Yu Deng, Mengyue Yi, Yuhui Zhu . 4mCPred-GSIMP: Predicting DNA N4-methylcytosine sites in the mouse genome with multi-Scale adaptive features extraction and fusion. Mathematical Biosciences and Engineering, 2024, 21(1): 253-271. doi: 10.3934/mbe.2024012
[2]	Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310
[3]	Jianhua Jia, Lulu Qin, Rufeng Lei . DGA-5mC: A 5-methylcytosine site prediction model based on an improved DenseNet and bidirectional GRU method. Mathematical Biosciences and Engineering, 2023, 20(6): 9759-9780. doi: 10.3934/mbe.2023428
[4]	Xiao Wang, Jianbiao Zhang, Ai Zhang, Jinchang Ren . TKRD: Trusted kernel rootkit detection for cybersecurity of VMs based on machine learning and memory forensic analysis. Mathematical Biosciences and Engineering, 2019, 16(4): 2650-2667. doi: 10.3934/mbe.2019132
[5]	Bo Kou, Jinde Cao, Wei Huang, Tao Ma . The rutting model of semi-rigid asphalt pavement based on RIOHTRACK full-scale track. Mathematical Biosciences and Engineering, 2023, 20(5): 8124-8145. doi: 10.3934/mbe.2023353
[6]	Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644
[7]	Zhaoting Yin, Jianyi Lyu, Guiyang Zhang, Xiaohong Huang, Qinghua Ma, Jinyun Jiang . SoftVoting6mA: An improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes. Mathematical Biosciences and Engineering, 2024, 21(3): 3798-3815. doi: 10.3934/mbe.2024169
[8]	Lili Jiang, Sirong Chen, Yuanhui Wu, Da Zhou, Lihua Duan . Prediction of coronary heart disease in gout patients using machine learning models. Mathematical Biosciences and Engineering, 2023, 20(3): 4574-4591. doi: 10.3934/mbe.2023212
[9]	Danuta Gaweł, Krzysztof Fujarewicz . On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences and Engineering, 2013, 10(3): 667-690. doi: 10.3934/mbe.2013.10.667
[10]	Huiqing Wang, Xiao Han, Jianxue Ren, Hao Cheng, Haolin Li, Ying Li, Xue Li . A prognostic prediction model for ovarian cancer using a cross-modal view correlation discovery network. Mathematical Biosciences and Engineering, 2024, 21(1): 736-764. doi: 10.3934/mbe.2024031

Abstract

1. Introduction

DNA modifications, such as demethylation and methylation, play important roles in the regulation of gene expression ^[1]. At the site of (5'-C-phosphate-G-3'), the methylation of cytosine is an important epigenetic trait, which is closely related to cell proliferation and chromosomal stability protection ^[2,3]. 5-methylcytosine (5mC), 4-methylcytosine (4mC), and 3-methylcytosine are the most common methylations of cytosine in eukaryotic and prokaryotic genomes ^[4,5]. 5mC is the common kind of methylation of cytosine and, relates to many cancerous and neural diseases ^[6,7]. 4mC is also an effective modification that guards its own genetic information from deterioration through restriction enzymes ^[8,9,10]. Accurate recognition of 4mC could provide key clues for understanding its regulation roles. Currently, several experimental methodologies, including mass spectrometry, reduced-representation bisulfite sequencing, and single-molecule real-time sequencing, have been developed to identify 4mC sites ^[11,12,13]. Although these methodologies are helpful in the identification of 4mC sites, they are highly expensive when implemented on extensively large sequencing data. Thus, a bioinformatics tool to identify 4mC sites is urgently needed. At present, some computational methods have been presented to identify 4mC sites. In 2017, an innovative prediction model based on the confirmed 4mC dataset was constructed to predict 4mC sites in several species ^[14]. Afterwards, an iterative feature representative algorithm was designed based on the benchmark dataset of Chen et al. ^[15], which helped to learn and train the features from numerous progressive models to predict 4mC sites. iEC4mC-SVM ^[16] was developed to predict the 4mC in the Escherichia coli by using light gradient boosting machine feature selection technology. DNA4mc-LIP ^[17], a linear integration tool, was developed by combining existing prediction methods to identify 4-methyl cytosine sites in multiple species. Then, Meta-4mCpred ^[18] was developed to predict 4mC sites in the genomes of six species. However, to date, only two predictors, i4mC-Mouse and 4mCpred-EL are available for recognizing 4mC sites in mice ^[19,20]. These two methods employed various features and machine learning algorithms on the sequence data of mice derived from the Meth-SMRT database ^[21]. Although both i4mC-Mouse and 4mCpred-EL can produce good outcomes, there is still room for further improvement by extracting more feature information.

To address the aforementioned issues, an ensemble model was established to predict 4mC sites in mice. Figure 1 shows the workflow of the proposed model. First, three types of feature descriptors, k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs, were used as features to input into a random forest classifier ^[22] for identifying 4mC sites. After this, the mRMR ^[23] with IFS ^[24,25] technique was utilized to get optimal feature vectors. Finally, the best model was examined on an independent dataset. The outcomes on independent-samples indicated that the proposed model outpaced the two existed predictors, i4mC-Mouse and 4mCpred-EL.

Figure 1. The workflow of the prediction of 4mC sites in mouse genome.

DownLoad: Full-Size Img PowerPoint

2. Materials and methods

A reliable and accurate dataset is necessary to establish a prediction model. Therefore, we obtained the benchmark dataset from Hasan et al. work ^[20], and Manavalan et. al. ^[19]. In their study, they excluded similar sequences using 70% as cutoff of sequence identity ^[26]. After this elimination procedure, they finally obtained the benchmark dataset of 906 positive and 906 negative sequences with length of 41bp. Subsequently, the benchmark data were separated into 80% training data and 20% independent data to objectively estimate the efficiencies and performances of predictors, as shown in Table 1.

Table 1. The distribution of sample numbers in benchmark dataset.

Attribute	Training Data	Independent Data	Total
Positive	746	160	906
Negative	746	160	906
Total	1492	320	1812

| Show Table

DownLoad: CSV

2.1. Feature descriptors

Selecting the feature-encodings that are instructive and autonomous is an important stage in creating machine learning based models, such as BioSeq-Analysis2.0 ^[27], IDP-Seq2Seq ^[28], ACPred ^[29], iBitter-SCM ^[30], iTTCA-Hybrid ^[31], Meta-iAVP ^[32], PseKRAAC ^[33], iBLP ^[34] and so on ^[35,36]. Expressing the DNA sequences with a mathematical manifestation is very important in functional element identification. Zhang et al. obtained optimal nonamer composition to represent the sequences of mRNA ^[37]. Dao et al. used three types of feature encodings physiochemical properties, binary encodings and nucleotide chemical properties ^[38]. Yang et al. identify recombination site based on k-mer composition ^[39]. Dou et al. used k-mer nucleotide composition, nucleotide chemical properties and pseudo dinucleotide composition to identify RNA modification site ^[40]. Wei et al. identified circRNA-disease associations based on matrix factorization ^[41]. Zheng et al. developed reduced amino acid clusters ^[42]. Lv et al. applied k-tuple nucleotide frequency component, nucleotide pair spectrum encoding and natural vector in 3D genome ^[43]. Here, three types of feature-encoding approaches were presented to describe the DNA sequences.

2.1.1. k-mer nucleotide compositions (k-mer NC)

k-mer NC can reflect short-range nucleotide interaction of sequences ^[44,45,46]. The (N-k+1) nucleotide residues can be obtained via a sliding window method by setting the window size of k bp with step size of 1 bp to examine a sequence with N bp. An arbitrary sample M with the sequence length of N (here N is 41bp) can be characterized as

$M = {R}_{1}{R}_{2}{R}_{3}\dots ..{R}_{i}\dots ..{R}_{\left(N-1\right)}{R}_{N}$

(1)

where R_i signifies the nucleotide (A, T, C, and G) at the i-th position. The sequences can be transformed into the 4^k-D vector using k-mer nucleotide composition as follows

${M}_{k} = {[{f}_{1}^{k-tuple}{f}_{2}^{k-tuple}\dots \dots .{f}_{i}^{k-tuple}\dots ..{f}_{{4}^{k}}^{k-tuple}]}^{T}$

(2)

where T denotes the transposition of the vector, and f₁^k-tuple symbolizes the occurrence of the i-th k-mer nucleotide composition in the sequence. When k = 1, a DNA sample can be deciphered into a 4-D vector M₁ = [f(A), f(G), f(C), f(T)]^T. When k = 2, the DNA sample can be described by a 16-dimension vector. In this study, the value of k was set as (1, 2, … 6). Therefore, a sequence sample can be transformed into a 5460 (4¹ + 4² + 4³ + 4⁴ + 4⁵ + 4⁶) dimension vectors formulated as follows

$M = {M}_{1}\cup {M}_{2}\cup {M}_{3}\cup {M}_{4}\cup {M}_{5}\cup {M}_{6}$

(3)

2.1.2. Enhanced nucleic acid composition (ENAC)

The ENAC calculates the nucleic acid composition based on the sequence window. It can be used to formulate the sequence with equal length. The enhanced nucleic acid composition can be calculated as

$Q = [\frac{{N}_{A, win1}}{k}\frac{{N}_{G, win1}}{k}, \frac{{N}_{C, win1}}{k}, \frac{{N}_{T, win1}}{k}, \frac{{N}_{A, win2}}{k}\dots .\frac{{N}_{G, winL-k+1}}{k}, \frac{{N}_{T, win l-k+1}}{k}]$

(4)

In Equation (4), k characterizes the size of the sliding window, N_{A, win} denotes the number of nucleotide A in the sliding window p, T $\in$ [G, C, A, T], and (p = 1, 2, ..., L-k+1). In this study, the sliding window was set to 5. Then the feature dimension is 148.

2.1.3. Composition of k-spaced nucleic acid pairs (CKSNAP)

The CKSNAP embodies the incidence of nucleotide pairs disconnected by any k nucleotide (k = 0, 1, 2, 3, 4 5). The composition of k-spaced nucleic acid pairs feature comprises 16 nucleotide pairs [AA, AG, ... TG, TT]. By taking k = 1 as an instance, composition of k-spaced nucleic acid pairs can be specified as follows:

${Q = [\frac{{N}_{A*A}}{{N}_{Total}}, \frac{{N}_{A*G}}{{N}_{Total}}, ..\frac{{N}_{T*G}}{{N}_{Total}}, \frac{{N}_{T*T}}{{N}_{Total}}]}_{16}$

(5)

where * signifies (A, G, C, and T), N_{Y ⃰ Z} signifies the number of nucleotides Y*Z pairs in the sequence, and N_Total embodies the total number of single-spaced nucleotide pairs in the sequence. If the nucleic acid pair AA appears j times in the nucleotide sequence, the composition of the nucleic acid pair AA can be equal to j divided by the total number of 0-spaced nucleic acid pairs N_Total in the nucleotide sequence. For k = 0, 1, 2, 3, 4 and 5, the value of N_Total is $P-$ 1, $P-$ 2, $P-$ 3, $P-$ 4, $P-$ 5 and $P-$ 6 for a nucleotide sequence of length $P$ , respectively. In this study, k = 2 and the dimension of the composition of k-spaced nucleic acid pairs feature was 48.

2.1.4. Feature selection with mRMR and IFS

The insertion of noisy features might result in the unsatisfactory performance of a model. Dao et al. proposed a two-step feature selection strategy to exclude noise ^[47]. Feng et al. used a mRMR technique to reduced noise ^[48]. Shao et al. performed three ranking algorithms to exclude irrelevant features ^[49]. Cheng et al. used MetaMap to reduced noisy features ^[50]. Other computational works did the similar works ^[51,52,53]. Therefore, the selection of features is an obligatory phase to remove the less important features and increase the productivity of a model ^[54]. Many feature selection and ranking techniques are available, such as f-score, mRMR ^[23], MRMD ^[55], chi-square ^[56]. In this study, mRMR with IFS ^[24,57] was applied to obtain the optimal feature subset. mRMR is a filter-based selection technique ^[58] to achieve an optimal model. Compactness functions are described as y and z, and P (y) and P (z) are the two corresponding probabilities. P (y, z) is the possibility of compactness, and the common information between the two functions can be demarcated as

$I\left(y;z\right) = \iint P\left(y, z\right)\mathit{log}\frac{P(y, z)}{P\left(y\right)P\left(z\right)}dydz$

(6)

In shared information, searching a subset S with m optimum features helps to determine the feature transmission, which majorly depends on the target {y𝑖} class q.

$\mathit{max \ d}\left(S, q\right), d = \frac{1}{\left|S\right|}{\sum }_{{y}_{i \in S}}I\left({y}_{i}, q\right)(i = \mathrm{1, 2}, 3\dots m)$

(7)

Minimum redundancy can be defined as

$\mathit{min \ r}\left(S, q\right), r = \frac{1}{{\left|S\right|}^{2}}{\sum }_{{y}_{i, }{y}_{j\in S}}I\left({y}_{i}, {y}_{j}\right)$

(8)

Final selection criteria can be articulated as:

$\mathit{max\varnothing }\left(d, r\right), \varnothing = d-r$

(9)

The principle of the mRMR technique is to use a typical redundancy and relevance to rank features to acquire the best subset. Mostly, if a model was built on a high-dimensional feature subset, it can produce overfitting and informational redundancy problems. Therefore, mRMR (minimum redundancy maximum relevance) with the IFS (Incremental Feature Selection) ^[24,59] technique and five-fold cross-validation method was applied to examine the optimal feature subset with the maximum accuracy. We ranked all features according to the $\varnothing$ -values and obtained new feature vectors, which is given in the below equation 10.

${I}^{*} = {[{h}_{1}, {h}_{2}, {h}_{3}...{h}_{n}]}^{T}$

(10)

The first feature subset comprises the feature with the highest $\varnothing$ -value ${I}^{*} = {\left[{h}_{1}\right]}^{T}$ . By adding the second highest $\varnothing$ -value to the first subset, the second feature subset ${I}^{*} = {[{h}_{1}, {h}_{2}]}^{T}$ is formed and by adding the third highest $\varnothing$ -value to the second feature subset, the third feature subset ${I}^{*} = {[{h}_{1}, {h}_{2}, {h}_{3}]}^{T}$ is formed ^[47]. The process was repeated until all the features were considered.

2.1.5. Machine learning classifier

Support vector machine is very famous and has been used in many bioinformatics tools ^[44,45,46]. It performs binary classification on data in supervised learning. We have used a free available package LibSVM version 3.21, which can be easily downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm/ to train and test the model. We have used rbf kernel function due to its efficiency in non-linear classification. We have optimized cost and gamma parameters of RBF kernel function by using grid search with searching space [2^-5, 2⁵] for cost and [2^-12, 2¹] for gamma. Naïve Bayes classifier has been widely used in bioinformatics due to its simplicity and better performance ^[60]. It is a classification technique and totally depends on Bayes theorem. Ada boost classifier is also very famous and has been widely used in bioinformatics ^[61]. It is an ensemble technique and combines various classifiers to enhance the accuracy. The main idea of this is to set the classifiers weights and trained the data in each iteration. We implemented these classifiers in Weka (version 3.8.4) ^[62]. Random forest is a combined knowledge technique extensively applied in bioinformatics ^[63,64]. The underlying principle is to combine several weak classifiers. The outcome is attained by the voting process therefore, the outcome of the model has higher exactness and simplification. The model was constructed using a random forest algorithm ^[22] and the complete procedure is clearly described in ^[65]. Scikit - learn package (v - 0.22.1) ^[66,67] was used to execute the random forest classifiers. Firstly, we used randomized search CV and then grid search CV to tune hyperparameter. The best tuned parameters of the proposed model are given in Table 3.

2.1.6. Evaluation metrics

Matthews correlation coefficient (MCC), accuracy (Acc), sensitivity (Sn) and specificity (Sp) were used in this study to check the overall efficiency of the model defined as Equation 11.

$\left\{ \begin{array}{l} \;\;\;\;\;\;\;\;\;\;\;Sn = \frac{{TP}}{{TP + FN}}\\ \;\;\;\;\;\;\;\;\;\;Sp = \frac{{TN}}{{TN + FP}}\\ \;\;\;\;\;Acc = \frac{{TP + TN}}{{TP + FP + TN + FN}}\\ \;\;MCC = \frac{{TP \times TN - FP \times FN}}{{\sqrt {(TP + FN) \times (TN + FN) \times (TP + FP) \times (TN + FP)} }} \end{array} \right.$

(11)

where TP represents the correctly identified 4mC sequences in benchmark data and FP signifies the 4mC sequences false-classified as non-4mC. Likewise, TN represents the correctly recognized non-4mC sequences in the data and FN signifies the non-4mC sequences, which were false-classified as 4mC. Consequently, the receiver operating characteristic (ROC) curve was used to illustrate the efficiency of the model graphically. The ROC curvature could assess the projecting ability of the proposed model on the whole assortment of resultant values. The area under the curve was premeditated to check the efficiency of the model. A good classifier gave AUC = 1, and the arbitrary performance gave AUC = 0.5.

3. Results and discussion

3.1. Composition analysis of sequences

The sequence pattern around the modification site is an operative stage to predict and interpret the genetic meanings of variations ^[68,69]. In this study, Two Sample Logo ^[70] (http://www.twosamplelogo.org/cgi-bin/tsl/tsl.cgi) was used to examine the distribution of nucleotides around 4mC. Figure 2 shows that nucleotide distribution among positive and negative sequences are different in regions flanking the nucleotide C. Both T and C nucleotides were individually abundant at the upstream and downstream of the positive sequences, whereas A and G were correspondingly enriched at the upstream and downstream of the negative samples. Some nucleotides tend to act continuously along the sequences.

Figure 2. Compositional preferences of sequence between 4mC and non-4mC sites.

DownLoad: Full-Size Img PowerPoint

For example, five sequential C nucleotides (6–10, 13–17 and 35–39) were found in positive sequences, while three successive A nucleotides (1–3), (8–10) and six repeated A nucleotides (36–41) were observed in negative sequences. Figure 2 also shows that there was significant difference between 4mC samples and non-4mC samples (t-test, P-value < 0.05). Above results suggested that the nucleotides distribution in different positions are helpful for the accurate classiﬁcation of 4mC and non-4mC samples.

3.2. Performance evaluation

Based on sequence feature, we constructed a model to identify 4mC site. First, the training data were converted into feature vectors using feature descriptors (k-mer, composition of k-spaced nucleic acid pairs, enhanced nucleic acid composition, and feature fusion). Subsequently, the feature vectors of each encoding model were evaluated by random forest classifier using a five-fold CV test. mRMR with IFS method was used to pick out the best feature subset for the sake of better prediction accuracy. Figure 3 shows the IFS curve for searching optimal features. Table 2 recorded that performances of the three single-encoding models and the feature fusion model. The AUCs of single-encoding models (k-mer, CKSNAP, and ENAC) were 0.88, 0.80, and 0.79, respectively. The AUC of k-mer was around 1%–4% higher than those of the other encodings. Fusion feature-based model could produce the best results. In this optimal model, the Acc, MCC, Sn, Sp, and AUC were 79.91%, 0.598, 81.88%, 78.12% and 0.908, respectively. Figure 4 also shows the AUC of random forest based fusion model on training dataset and independent dataset by using five-fold cross validation. The best parameters were shown in Table 3.

Figure 3. IFS curve of the optimal features.

DownLoad: Full-Size Img PowerPoint

Table 2. Performance on the basis of single encoding model using random forest.

Method	k	FS	Dimension	Ac (%)	MCC	Sn (%)	Sp (%)	AUC
CKSNAP	2	No Yes	48 7	72.28 72.54	0.448 0.450	71.09 72.00	70.00 71.00	0.787 0.800
ENAC	5	No Yes	148 13	70.02 70.98	0.418 0.425	75.00 77.00	68.82 67.00	0.77.6 0.790
k-mer	6	No Yes Yes Yes Yes	5460 4088 2426 1221 100	76.92 75.66 77.32 78.12 78.57	0.557 0.539 0.563 0.568 0.571	77.20 76.80 79.20 80.20 80.77	78.34 77.34 77.64 78.14 77.18	0.873 0.863 0.878 0.883 0.887
Fusion model		No Yes Yes Yes Yes Yes	5656 4020 3105 2088 1023 120	77.95 77.80 78.30 77.90 79.54 79.91	0.567 0.561 0.581 0.578 0.596 0.598	80.20 78.45 80.25 78.55 81.32 81.69	78.10 79.20 79.10 78.04 78.40 78.12	0.881 0.881 0.893 0.886 0.903 0.908

| Show Table

DownLoad: CSV

Figure 4. The ROC curve was evaluated on the training and independent dataset by a 5-fold cross validation test.

DownLoad: Full-Size Img PowerPoint

Table 3. Best parameters of the proposed model by 5-fold CV test.

Best Parameters
'Bootstrap'	True
'Max-depth'	30
'Max-features'	2
'Min-samples-leaf'	1
'Min-samples-split'	8
'n-estimators'	40

| Show Table

DownLoad: CSV

3.3. Performance evaluation of different ML algorithms

k-mer, CKSNAP, ENAC and their fusion were inputted into three machine learning classifiers, namely Adaboost, SVM, and Naive Bayes algorithm, for comparing with random forest classifier-based models ^[71]. Cross-validation is a statistical analysis method and has been widely used in machine learning to train and test model. A five-fold CV test was used to elevate their corresponding machine learning constraints on individual encoding classifiers. In five-fold CV, the benchmark dataset was arbitrarily separated into five groups of about equal size. Each group was individualistically tested by the model which trained with the remaining four groups. Therefore, the five-fold CV method was performed five times, and the average of the results was the final result. Finally, an ideal model was achieved for each classifier. The results are shown in Table 4. We noticed that fused feature did produce high accuracy except Adaboost (69.57%). Then, comparison between feature fusion-based models with single-encoding based models indicates that the multiple information was effective to achieve better results. As shown in Figure 5, based on fused features, random forest model exhibits higher accuracy compare with other three machine learning models. Particularly, the AUC of the feature fusion model based on random forest classifier was 1%–10% higher than that of the other models, indicating that the random forest model was the best for 4mC identification.

Table 4. Performances of all the models using different machine learning approaches.

Classifier	Method	Acc (%)	MCC	Sn (%)	Sp (%)	AUC
RF	CKSNAP	72.54	0.450	72.00	71.00	0.800
	ENAC	70.98	0.425	77.00	67.00	0.790
	k-mer	78.57	0.571	80.00	77.00	0.880
	Fusion	79.91	0.598	81.88	78.12	0.908
AB	CKSNAP	69.03	0.381	69.00	69.00	0.746
	ENAC	67.02	0.342	72.40	65.40	0.736
	k-mer	70.30	0.406	70.60	70.20	0.772
	Fusion	69.57	0.391	69.20	69.70	0.766
SVM	CKSNAP	66.75	0.335	65.50	67.20	0.668
	ENAC	49.93	-0.01	59.90	49.90	0.499
	k-mer	76.74	0.536	73.60	78.50	0.767
	Fusion	77.56	0.571	77.25	77.10	0.862
NB	CKSNAP	67.09	0.342	65.70	67.60	0.744
	ENAC	68.83	0.377	68.60	68.90	0.755
	k-mer	77.61	0.554	81.80	75.50	0.854
	Fusion	78.75	0.576	81.60	77.20	0.863

| Show Table

DownLoad: CSV

Figure 5. Matrix values of feature fusion models on four different ML algorithms. Performances were evaluated on the training dataset by 5-fold cross-validation test.

DownLoad: Full-Size Img PowerPoint

3.4. Comparison with existing models on an independent dataset

Independent dataset test was used to examine and compare the anticipated model with already published models. Two existing models, i4mC-Mouse and 4mCpred-EL could provide 4mC identification in mouse. Therefore, the efficiency of the proposed model was assessed against that of the aforementioned two existed models on the same independent dataset (160 4mC, and 160 non-4mC), as shown in Table 5. The MCC, Sn, Sp, Acc, and AUC of the i4mC-Mouse were 0.633, 80.71%, 82.52%, 81.61%, and 0.920, respectively. The MCC, Sn, Sp, Acc, and AUC of the 4mCpred-EL were 0.584, 75.72%, 82.51%, 79.10% and 0.881, respectively. The Feature Fusion model could produce 0.711, 82.00%, 89.13%, 85.41%, and 0.944, respectively for MCC, Sn, Sp, Acc, and AUC. Obviously, our proposed model outpaced both existing models by 2.4% and 6.3% in AUC which is shown in Figure 6. The good performance of the proposed model was due to the use of different and accurate encoding schemes and the selection of suitable classifiers.

Table 5. Comparison between proposed model and existing methods.

Method	Acc (%)	MCC	Sn (%)	Sp (%)	References	AUC
4mCpred-EL	79.10	0.584	75.72	82.51	^[19]	0.881
i4mC-Mouse	81.61	0.633	80.71	82.52	^[20]	0.920
model_4mc	85.41	0.711	82.00	89.13	Our Work	0.944

| Show Table

DownLoad: CSV

Figure 6. AUC of proposed model and two existing tools.

DownLoad: Full-Size Img PowerPoint

4. Conclusions

4mC is a DNA modification with a series of significant genetic progressions such as regulation of gene expression and cell differentiation. The identification of 4mC sites in the whole genome is vital for understanding their genetic roles. To date, numerous predictors have been established to classify 4mC sites in diverse species ^{[14,17,18,72,73,74]}, but only two methods 4mCpred-EL ^[19] and i4mC-Mouse ^[20] exist for mice. In this study, an advanced ensemble model was established to identify 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded using k-mer, CKSNAP and ENAC. Then, these encoding-features were optimized by using mRMR with IFS. On the basis of the top feature subset, the finest 4mC sorting model was achieved by the random forest classifier using a five-fold CV test. The estimated outcomes on independent data showed that the proposed model provided outstanding generalization capability. Further studies will aim to create a user-friendly web server for the projected model. Also, additional feature selection methods and algorithms will be implemented to further improve the efficiency to classify 4mC sites.

Acknowledgments

This work has been supported by the China Postdoctoral Science Foundation (2020M673188).

Conflict of interest

The authors declare that there is no conflict of interest.

References

[1]	Anderson K, Brooks C, Katsaris A (2010) Speculative bubbles in the S&P 500: was the tech bubble confined to the tech sector? J Empir Financ 17: 345–361. doi: 10.1016/j.jempfin.2009.12.004
[2]	Apergis N, Miller SM (2009) Do structural oil-market shocks affects stock return. Energy Econ 31: 569–575. doi: 10.1016/j.eneco.2009.03.001
[3]	Arouri MEH, Rault C (2011) On the influence of oil prices on stock markets: evidence from panel analysis in GCC countries. Int J Financ Econ 3: 242–253.
[4]	Arouri MEH, Jouini J, Nguyen DK (2012) On the impacts of oil price fluctuations on European equity markets: volatility spillover and hedging effectiveness. Energy Econ 34: 611–617. doi: 10.1016/j.eneco.2011.08.009
[5]	Baker M, Wurgler J (2007) Investor sentiment in the stock market. J Econ Perspectives 21: 129–152. doi: 10.1257/jep.21.2.129
[6]	Beckman J, Belke A, Kühl M (2011) Global integration of central and eastern european financial markets-the role of economic sentiments. Rev Int Econ 19: 137–157. doi: 10.1111/j.1467-9396.2010.00937.x
[7]	Bildirici M (2012a) Economic growth and energy consumption in G7 countries: ms-var and ms-granger causality analysis. The J of Energy and Development 38: 1–30.
[8]	Bildirici M (2012b) The relationship between economic growth and electricity consumption in africa: ms-var and ms-granger causality analysis. The J of Energy and Development 37: 179–207.
[9]	Bildirici M (2013) Economic growth and electricity consumption: MS-VAR and MS-Granger causality analysis. OPEC Energy Rev 38: 447–476.
[10]	Brown GW, Cliff MT (2005) Investor sentiment and asset valuation. J of Business 78: 405–440. doi: 10.1086/427633
[11]	Campbell JY (1991) A variance decomposition for stock returns. Econ J 101: 157–179. doi: 10.2307/2233809
[12]	Chen NF, Roll RS, Ross A (1986) Economic Forces and the Stock Market. J of Business 59: 383–403. doi: 10.1086/296344
[13]	Chen SS (2010) Do higher oil prices push the stock market into bear territory? Energy Econ 32: 490–495. doi: 10.1016/j.eneco.2009.08.018
[14]	Ciner C (2001) Energy shocks and financial markets: nonlinear linkages. Studies in Non-Linear Dynamics and Econ 5: 203–212. doi: 10.1162/10811820160080095
[15]	Chang TP, Hu JL (2009) Incorporating a leading indicator into the trading rule through the markov-switching vector autoregression model. Appl Econ 16: 1255–1259.
[16]	Cong RG, Wei YM, Jiao JL et al. (2008) Relationships between oil price shocks and stock market: An empirical analysis from China. Energy Policy 36: 3544–3553. doi: 10.1016/j.enpol.2008.06.006
[17]	Degiannakis S, Filis G, Kizys R (2014) The effects of oil price shocks on stock market volatility: evidence from European data. Energy J 35: 35–56.
[18]	DG Sanco Final Report by Europe Economics (2007) An analysis of the issue of consumer detriment and the most appropriate methodologies to estimate it. 1–574. Available from: http://www.europe-economics.com/publications/study_consumer_detriment.pdf .
[19]	Ding Z, Liu Z, Zhang Y, et al. (2017) The contagion effect of international crude oil price fluctuations on chinese stock market investor sentiment. Appl Energy 187: 27–36. doi: 10.1016/j.apenergy.2016.11.037
[20]	Electronic Data Delivery System. Available from: https://evds2.tcmb.gov.tr/.
[21]	Energy Information Administration. Available from: https://www.eia.gov/.
[22]	Ewing B, Thmpson M (2007) Dynamic cyclical comovements of oil prices with industrial production, consumer prices, unemployment, and stock prices. Energy Policy 35: 5535–5540. doi: 10.1016/j.enpol.2007.05.018
[23]	European Central Bank Monthly Bulletin. 45–58. Available from: https://www.ecb.europa.eu/pub/pdf/mobu/mb201301en.pdf.
[24]	Faff RW, Brailsford TJ (1999) Oil price risk and the Australian stock market. J of Energy and Financ Development 4: 69–87. doi: 10.1016/S1085-7443(99)00005-8
[25]	Fallahi F, Rodriguez G (2007) Using markov-switching models to identify the link between unemployment and criminality. Available from: https://sciencessociales.uottawa.ca/economics/sites/socialsciences.uottawa.ca.economics/files/0701E.pdf.
[26]	Falahi F (2011) Causal relationship between energy consumption (ec) and gdp: a markov-switching (ms) causality. Energy 36: 4165–4170. doi: 10.1016/j.energy.2011.04.027
[27]	Fisher I (1906) The Nature of Capital and Income, New York, Macmillan.
[28]	Hamilton JD (1988) Rational-expectations econometric analysis of changes in regime: An investigation of the term structure of interest rates. J of Econ Dynamics and Control 12: 385–423. doi: 10.1016/0165-1889(88)90047-4
[29]	Hamilton JD (1989) A new approach to the economic analysis of nonstationary time series and the business cycle. Econ 57: 357–384. doi: 10.2307/1912559
[30]	Hamilton JD (1990) Analysis of time series subject to changes in regime. J of Econ 45: 39–70. doi: 10.1016/0304-4076(90)90093-9
[31]	Hicks JR (1934a) Application of mathematical methods to the theory of risk. Econ 4: 194–5.
[32]	Hicks JR (1934b) A note on the elasticity of supply. Rev of Econ Studies 10: 31–7.
[33]	Huang RD, Masulis RW, Stoll HR (1996) Energy shocks and financial markets. J Futures Mark 16: 1–27. doi: 10.1002/(SICI)1096-9934(199602)16:1<1::AID-FUT1>3.0.CO;2-Q
[34]	Jones CM, Kaul G (1996) Oil and the stock markets. J Financ 51: 463–491. doi: 10.1111/j.1540-6261.1996.tb02691.x
[35]	KangW,Ratti RA, Yoon KH (2015) The impact of oil price shocks on the stock market return and volatility relationship, Journal of International Financial Markets. Institutions and Money 34: 41–54. doi: 10.1016/j.intfin.2014.11.002
[36]	Kaul G, Seyhun HN (1990) Relative price variability, real shocks, and the stock market. The j of financ 45: 479–496. doi: 10.1111/j.1540-6261.1990.tb03699.x
[37]	Keynes JM (1936) The General Theory Of Employment, Interest And Money, Palgrave MacMillan, London.
[38]	Kilian L, Park C (2009) The impact of oil price shocks on the US stock market. Int Econ Rev 50: 1267–128. doi: 10.1111/j.1468-2354.2009.00568.x
[39]	Kling JL (1985) Oil Price Shocks and Stock-Market Behavior. J of Portfolio Management 12: 34–39. doi: 10.3905/jpm.1985.409034
[40]	Krolzig HM (1997a) International business cycles: Regime shifts in the stochastic process of economic growth. Applied Economics Discussion Paper 194, University of Oxford.
[41]	Krolzig HM (1997b) Markov Switching Vector Autoregressions. Modelling, Statistical Inference and Application to Business Cycle Analysis. Berlin: Springer.
[42]	Krolzig HM (1998) Econometric modelling of Markov-switching vector autoregressions using MSVAR for Ox. Discussion paper, Institute of Economics and Statistics, University of Oxford.
[43]	Krolzig HM (2000) Predicting Markov-switching vector autoregressive processes. Economics discussion paper 2000–W31, Nuffield College, Oxford.
[44]	Krolzig HM, Marcellino M, Mizon GE (2002) A Markov-switching vector equilibrium correction model of the UK labour market, In: Hamilton J.D. Raj B, Advances in Markov-Switching Models, Studies in Empirical Economics, Physica, Heidelberg.
[45]	Krolzig HM (2001) Business cycle measurement in the presence of structural change: International evidence. Int J Forecast 17: 349–368. doi: 10.1016/S0169-2070(01)00099-1
[46]	Lemmon M, Portniaguina E (2006) Consumer confidence and asset prices: some empirical evidence. Rev Financ Stud 19: 1499–1529. doi: 10.1093/rfs/hhj038
[47]	Malik F, Ewing BT (2009) Volatility transmission between oil prices and equity sector returns. Int Rev Financ Anal 18: 95–100. doi: 10.1016/j.irfa.2009.03.003
[48]	Miller JI, Ratti RA (2009) Crude oil and stock markets: stability, instability and bubbles. Energy Econ 31: 559–568. doi: 10.1016/j.eneco.2009.01.009
[49]	Nandha M, Faff R (2008) Does oil move equity prices? A global view. Energy Econ 30: 986–997.
[50]	Park J, Ratti RA (2008) Oil price shocks and stock markets in the US and 13 European countries. Energy Econ 30: 2587–2608. doi: 10.1016/j.eneco.2008.04.003
[51]	Qadan M, Nama H (2018) Investor sentiment and the price of oil. Energy Econ 69: 42–58. doi: 10.1016/j.eneco.2017.10.035
[52]	Sadorsky P (1999) Oil price shocks and stock market activity. Energy Econ 21:449–469. doi: 10.1016/S0140-9883(99)00020-1
[53]	Schmeling M (2009) Investor sentiment and stock returns: Some international evidence. J Empir Financ 16: 394–408. doi: 10.1016/j.jempfin.2009.01.002
[54]	Shigeki O (2017) Oil price shocks and stock markets in BRICs. The European J of Comparative Econ: EJCE 14: 29–45.
[55]	University of Michigan Surveys of Consumers. Available from: http://www.sca.isr.umich.edu/.
[56]	Vo M (2011) Oil and stock market volatility: a multivariate stochastic volatility perspective. Energy Econ 33: 956–965. doi: 10.1016/j.eneco.2011.03.005
[57]	Wei Y, Guo X (2017) Oil price shocks and china's stock market. Energy 140: 185–197. doi: 10.1016/j.energy.2017.07.137
[58]	Wei C (2003) Energy, the Stock Market, and the Putty-Clay Investment Model. Am Econ Rev 93: 311–323. doi: 10.1257/000282803321455313
[59]	Yahoo Finance. Available from: https://finance.yahoo.com/quote/DATA/.
[60]	Zak PJ, Knack S (2001) Trust and growth. The Econ J 111: 295–321. doi: 10.1111/1468-0297.00609
[61]	Zouaoui M (2011) How does investor sentiment affect stock market crises? evidence from panel data. The Fianc Rev 46: 723–747.

This article has been cited by:

1.	Yixiao Zhai, Jingyu Zhang, Tianjiao Zhang, Yue Gong, Zixiao Zhang, Dandan Zhang, Yuming Zhao, AOPM: Application of Antioxidant Protein Classification Model in Predicting the Composition of Antioxidant Drugs, 2022, 12, 1663-9812, 10.3389/fphar.2021.818115
2.	Rajib Kumar Halder, Mohammed Nasir Uddin, Md. Ashraf Uddin, Sunil Aryal, Md. Aminul Islam, Fahima Hossain, Nusrat Jahan, Ansam Khraisat, Ammar Alazab, A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach, 2023, 14, 2073-4425, 582, 10.3390/genes14030582
3.	JuanYing XIE, MingZhao WANG, ShengQuan XU, DNA/RNA sequence feature representation algorithms for predicting methylation-modified sites, 2022, 1674-7232, 10.1360/SSV-2022-0074
4.	Mingzhao Wang, Juanying Xie, Philip W. Grant, Shengquan Xu, PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites, 2022, 606, 00200255, 968, 10.1016/j.ins.2022.05.060
5.	Peijie Zheng, Guiyang Zhang, Yuewu Liu, Guohua Huang, MultiScale-CNN-4mCPred: a multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction, 2023, 24, 1471-2105, 10.1186/s12859-023-05135-0
6.	Md Belal Bin Heyat, Faijan Akhtar, Syed Jafar Abbas, Mohammed Al-Sarem, Abdulrahman Alqarafi, Antony Stalin, Rashid Abbasi, Abdullah Y. Muaad, Dakun Lai, Kaishun Wu, Wearable Flexible Electronics Based Cardiac Electrode for Researcher Mental Stress Detection System Using Machine Learning Models on Single Lead Electrocardiogram Signal, 2022, 12, 2079-6374, 427, 10.3390/bios12060427
7.	Hasan Zulfiqar, Shi-Shi Yuan, Qin-Lai Huang, Zi-Jie Sun, Fu-Ying Dao, Xiao-Long Yu, Hao Lin, Identification of cyclin protein using gradient boost decision tree algorithm, 2021, 19, 20010370, 4123, 10.1016/j.csbj.2021.07.013
8.	Hasan Zulfiqar, Qin-Lai Huang, Hao Lv, Zi-Jie Sun, Fu-Ying Dao, Hao Lin, Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique, 2022, 23, 1422-0067, 1251, 10.3390/ijms23031251
9.	Lezheng Yu, Yonglin Zhang, Li Xue, Fengjuan Liu, Qi Chen, Jiesi Luo, Runyu Jing, Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning, 2022, 13, 1664-302X, 10.3389/fmicb.2022.843425
10.	Hasan Zulfiqar, Zi-Jie Sun, Qin-Lai Huang, Shi-Shi Yuan, Hao Lv, Fu-Ying Dao, Hao Lin, Yan-Wen Li, Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli, 2022, 203, 10462023, 558, 10.1016/j.ymeth.2021.07.011
11.	Hasan Zulfiqar, Zhiling Guo, Bakanina Kissanga Grace-Mercure, Zhao-Yue Zhang, Hui Gao, Hao Lin, Yun Wu, Empirical Comparison and Recent Advances of Computational Prediction of Hormone Binding Proteins Using Machine Learning Methods, 2023, 20010370, 10.1016/j.csbj.2023.03.024
12.	Hasan Zulfiqar, Zahoor Ahmed, Bakanina Kissanga Grace-Mercure, Farwa Hassan, Zhao-Yue Zhang, Fen Liu, Computational prediction of promotors in Agrobacterium tumefaciens strain C58 by using the machine learning technique, 2023, 14, 1664-302X, 10.3389/fmicb.2023.1170785
13.	Faijan Akhtar, Md Belal Bin Heyat, Arshiya Sultana, Saba Parveen, Hafiz Muhammad Zeeshan, Stalin Fathima Merlin, Bairong Shen, Dustin Pomary, Jian Ping Li, Mohamad Sawan, Medical intelligence for anxiety research: Insights from genetics, hormones, implant science, and smart devices with future strategies, 2024, 14, 1942-4787, 10.1002/widm.1552
14.	Yan Lin, Meili Sun, Junjie Zhang, Mingyan Li, Keli Yang, Chengyan Wu, Hasan Zulfiqar, Hongyan Lai, Computational identification of promoters in Klebsiella aerogenes by using support vector machine, 2023, 14, 1664-302X, 10.3389/fmicb.2023.1200678
15.	Hasan Zulfiqar, Ramala Masood Ahmad, Ali Raza, Sana Shahzad, Hao Lin, 2024, Chapter 2, 978-1-0716-4062-3, 33, 10.1007/978-1-0716-4063-0_2
16.	Hasan Zulfiqar, Zahoor Ahmed, Cai-Yi Ma, Rida Sarwar Khan, Bakanina Kissanga Grace-Mercure, Xiao-Long Yu, Zhao-Yue Zhang, Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy, 2022, 27, 2768-6701, 10.31083/j.fbl2703084

Reader Comments

Your name:*

Email:*
© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Quantitative Finance and Economics

3.2 0.3

Metrics

Article views(5360) PDF downloads(927) Cited by(18)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Tables(8)

Quantitative Finance and Economics

The effects of oil prices on confidence and stock return in China, India and Russia

Related Papers:

Abstract

1. Introduction

2. Materials and methods

2.1. Feature descriptors

2.1.1. k-mer nucleotide compositions (k-mer NC)

2.1.2. Enhanced nucleic acid composition (ENAC)

2.1.3. Composition of k-spaced nucleic acid pairs (CKSNAP)

2.1.4. Feature selection with mRMR and IFS

2.1.5. Machine learning classifier

2.1.6. Evaluation metrics

3. Results and discussion

3.1. Composition analysis of sequences

3.2. Performance evaluation

3.3. Performance evaluation of different ML algorithms

3.4. Comparison with existing models on an independent dataset

4. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Quantitative Finance and Economics

The effects of oil prices on confidence and stock return in China, India and Russia

Related Papers:

Abstract

1. Introduction

2. Materials and methods

2.1. Feature descriptors

2.1.1. k-mer nucleotide compositions (k-mer NC)

2.1.2. Enhanced nucleic acid composition (ENAC)

2.1.3. Composition of k-spaced nucleic acid pairs (CKSNAP)

2.1.4. Feature selection with mRMR and IFS

2.1.5. Machine learning classifier

2.1.6. Evaluation metrics

3. Results and discussion

3.1. Composition analysis of sequences

3.2. Performance evaluation

3.3. Performance evaluation of different ML algorithms

3.4. Comparison with existing models on an independent dataset

4. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog