Venture capital backing: financial policies and persistence over time

Antonio Gledson de Carvalho; Pinheiro Roberto; Sampaio Joelson; Antonio Gledson de Carvalho; Pinheiro Roberto; Sampaio Joelson

doi:10.3934/QFE.2021029

Quantitative Finance and Economics

2021, Volume 5, Issue 4: 640-663. doi: 10.3934/QFE.2021029

Previous Article Next Article

Research article

Venture capital backing: financial policies and persistence over time

1.
Fundacao Getulio Vargas, School of Business at Sao Paulo, FGV-EAESP, Av. Nove de Julho, 2029–Sala 912, São Paulo, SP, 01313-902, Brazil
2.
Federal Reserve Bank of Cleveland, Research Department, P.O. Box 6387, Cleveland, OH, 44101-1387, United States
3.
Fundacao Getulio Vargas, School of Economics at Sao Paulo, FGV EESP, Rua Itapeva, 474–Sala 1002, São Paulo, SP, 01313-902, Brazil

Received: 13 September 2021 Accepted: 26 November 2021 Published: 29 November 2021
JEL Codes: G24, G30, M41

The present article seeks to analyze the financial policies of companies backed by Private Equity and Venture Capital funds (PE/VC). Our sample consists of firms completing an initial public offering between January 1991 and December 2000. Our hypotheses relate to the difference between VC and non-VC-backed firms in terms of financial policies and their persistence. We use four measures to evaluate the firms' financial policies: i) Cash holdings; ii) Leverage; iii) dividends out of their earnings; and iv) interest coverage. To test the four hypotheses, we run Pooled OLS regressions. The results suggest that VC-backing firms keep a higher level of cash holdings than non-VC-backed firms. This effect lasts for at least 8 years after the IPO. We show that VC-backed firms are associated with a lower level of leverage over the first 8 years after the IPO. Differently, while interest coverage is lower in the first years after the IPO, results are not persistent, even reverting in later years. Finally, we do not find statistically significant evidence of a difference between VC- and non-VC-backed firms on dividend to earnings ratio. Our results are robust across statistical methods and different methodologies.

Keywords:

Citation: Antonio Gledson de Carvalho, Pinheiro Roberto, Sampaio Joelson. Venture capital backing: financial policies and persistence over time[J]. Quantitative Finance and Economics, 2021, 5(4): 640-663. doi: 10.3934/QFE.2021029

Related Papers:

[1]	Eun-jin Kim, Massimo Capoccia . Mechano-electric effect and a heart assist device in the synergistic model of cardiac function. Mathematical Biosciences and Engineering, 2020, 17(5): 5212-5233. doi: 10.3934/mbe.2020282
[2]	Nicholas Pearce, Eun-jin Kim . Modelling the cardiac response to a mechanical stimulation using a low-order model of the heart. Mathematical Biosciences and Engineering, 2021, 18(4): 4871-4893. doi: 10.3934/mbe.2021248
[3]	Zhaohai Liu, Houguang Liu, Jie Wang, Jianhua Yang, Jingbin Hao, Shanguo Yang . Analysis of design parameters of round-window stimulating type electromagnetic transducer by a nonlinear lumped parameter model of implanted human ear. Mathematical Biosciences and Engineering, 2022, 19(3): 2453-2470. doi: 10.3934/mbe.2022113
[4]	Alan Dyson . Traveling wave solutions to a neural field model with oscillatory synaptic coupling types. Mathematical Biosciences and Engineering, 2019, 16(2): 727-758. doi: 10.3934/mbe.2019035
[5]	H. G. E. Hentschel, Alan Fine, C. S. Pencea . Biological computing with diffusion and excitable calcium stores. Mathematical Biosciences and Engineering, 2004, 1(1): 147-159. doi: 10.3934/mbe.2004.1.147
[6]	Bogdan Kazmierczak, Zbigniew Peradzynski . Calcium waves with mechano-chemical couplings. Mathematical Biosciences and Engineering, 2013, 10(3): 743-759. doi: 10.3934/mbe.2013.10.743
[7]	Xiaolong Tan, Rui Zhu, Yan Xie, Yuan Chai . Suppression of absence seizures by using different stimulations in a reduced corticothalamic-basal ganglion-pedunculopontine nucleus model. Mathematical Biosciences and Engineering, 2023, 20(12): 20468-20485. doi: 10.3934/mbe.2023905
[8]	Prasina Alexander, Fatemeh Parastesh, Ibrahim Ismael Hamarash, Anitha Karthikeyan, Sajad Jafari, Shaobo He . Effect of the electromagnetic induction on a modified memristive neural map model. Mathematical Biosciences and Engineering, 2023, 20(10): 17849-17865. doi: 10.3934/mbe.2023793
[9]	Felicia Maria G. Magpantay, Xingfu Zou . Wave fronts in neuronal fields with nonlocal post-synaptic axonal connections and delayed nonlocal feedback connections. Mathematical Biosciences and Engineering, 2010, 7(2): 421-442. doi: 10.3934/mbe.2010.7.421
[10]	Weirui Lei, Jiwen Hu, Yatao Liu, Wenyi Liu, Xuekun Chen . Numerical evaluation of high-intensity focused ultrasound- induced thermal lesions in atherosclerotic plaques. Mathematical Biosciences and Engineering, 2021, 18(2): 1154-1168. doi: 10.3934/mbe.2021062

Abstract

1. Introduction

According to the World Health Organization, cardiovascular disease has become the most important cause of death worldwide. Atherosclerosis is a kind of slow narrowing of the arteries that influences the blood flow from the heart to the brain, also the cause of most cardiovascular diseases ^[1,2,3]. The subclinical latency period for atherosclerosis is 30 to 50 years, with a long asymptomatic period ^[4]. There are many factors that influence the onset of atherosclerosis, the influence factors are highly relevant and interact with each other. Therefore, doctors often diagnose patients based on a few typical features. And it's difficult to diagnose it accurately at an early stage when the typical features do not change significantly. The traditional statistical model is not effective enough when applied to this situation. In order to identify atherosclerosis earlier, machine learning methods are widely used to build models to predict and prevent the disease. Currently, algorithms like support vector machines, decision trees naive Bayes and several others enhanced the relevant diseases prediction and enabled the continuation of human life worldwide ^[5,6,7,8]. For instance, Couturier et al. decided to use a three-step method based on cluster-supervised classification and frequent itemset search to predict whether the patient was likely to develop atherosclerosis based on the relevance of his lifestyle habits and social environment ^[9]. Artificial neural networks (ANN) and random forests (RF) are also very widely used in atherosclerosis research. They are applicable to most of the datasets, having superior performance, and are less difficult to use and reduce the computational burden ^[10].

There are many academics who have contributed to the prediction of cardiovascular disease using different machine learning methods. We review the relevant literature and collect the following research methods and results which are shown in Table 1.

Table 1. Results and methods of previous studies.

Author	Method	Result
V. Sree Hari Rao ^[11]	In-built imputation algorithm and particle swarm optimization (PSO)	The best accuracy was 99.73%
Jiang Xie ^[12]	Weight learning approach	The prediction accuracy improved from 11.53 to 16.76% after weighted learning
Wenming He ^[13]	Kernel extreme learning machine (KELM) optimized by an improved salp swarm algorithm (SSA)	The classification accuracy obtained by STSSA-KELM was 84.40%
Andrew Ward ^[14]	Trained ML models	Bettr AUC was 0.835; AUC after incorporating additional EHR data was 0.790
Oumaima Terrada ^[7]	K-medoids and k-means clustering for classification, Artificial Neural Network (ANN) and K-Nearest Neighbor (KNN)	The best accuracy was 96%, the best Matthews's correlation coefficient was 0.92
Soodeh Nikan ^[15]	Ridge expectation maximization imputation (REMI) technique, conditional likelihood maximization method	The best accuracy was 88.04%
Jiang Xie ^[16]	Subset Learning (S-learning)	The best AUC was 0.83
Mohan Priya ^[17]	Fast correlation-based filter	About 99.47%
Antonis I. Sakellarios ^[18]	Gradient Boosted Trees (GBT) algorithm	The best accuracy was 68%, the best AUC was 0.59
Brajesh Kumar ^[19]	Support vector machine	The AUC with Hungarian dataset was 79.6%, The AUC with Cleveland dataset was 79.0%, The AUC with Z-Alizadeh Sani dataset was 91.2%, The AUC with Statlog dataset was 79.6%.

| Show Table

DownLoad: CSV

Over the past two decades, many doctors and researchers have tried to combine machine learning methods and imaging biomarkers to predict atherosclerosis. Lin ^[20] combined discriminative feature selection and a semi-supervised graph-based regression to detect changes of plaque. However, atherosclerosis is affected by many factors, the analysis of imaging biomarkers often leads to the ignore in early period. Terrada used different machine learning model like ANN and KNN ^[7] to achieve a high accuracy performance (The best accuracy was 96%). While, they didn't do additional feature selection for features which include a variety of demographic indexes, physical and chemical indexes, imaging and echo biomarkers. They are medically important, but highly relevant and interact with each other. Rao proposed N2Genetic optimizer ^[11] to improve the performance and their prediction accuracy can achieve 99.73%. Hathaway ^[21] tried to combined deep learning method and routine atherosclerosis prediction by using simple office-based clinical features. However, they focus on using computing optimization in different features while didn't consider the high relevant of features they use. These features can enhance the information redundancy of the model, and it is necessary to reduce the redundancy through feature selection. According to Jamthikar ^[22], supervised ML-based algorithms. were made up of five components: (i) data partitioning, (ii) feature engineering, (iii) training model, (iv) prediction model and (v) performance evaluation. Skandha ^[23] made a major breakthrough on the accuracy of the prediction through applying deep learning to training and prediction model. They have inspired us to focus our efforts on feature engineering and to further integrate it with operations research.

This paper presents an operations research-based machine learning approach for atherosclerosis prediction. The focus is on combining statistical analysis and machine learning methods to reduce information redundancy and further improve the accuracy of disease diagnosis. First, we remove the features with more missing values and fill in the features with fewer missing values among the 34 features, leaving 25 features. Then t-test and chi-square test are used for continuous and discrete features respectively, and the features that fail the statistical test are removed, leaving 20 features Next, machine learning combined with optimal correlation distance is used to filter out 15 significant features, and the prediction model is evaluated under these 15 features. Finally, the information of the screened 5 non-significant features is fully utilized by ensemble learning to improve the predictive advantage of atherosclerosis. The experiments show that the prediction performance is significantly improved by reducing the redundancy of information.

The rest of this paper is presented below. Section 2 briefly describes related work, including data sources and processing, selecting prediction performance metrics and prediction models, and using random forests to filter features. Section 3 focuses on the feature selection inspired by operations research, and proposes an optimal distance model based on Dijkstra. Experimental results are presented in Section 4. The paper concludes with some conclusions in Section 5.

2. Methods

2.1. Data sources and preprocessing

The data in this paper are obtained from a summation study conducted from January 2016 to December 2017 at Affiliated Hospital of Nanjing University of Chinese Medicine, while the study here was approved by the hospital's ethics committee after written informed consent was obtained from all patients. After screening, we select 34 characteristics. In order to better study the pathology of atherosclerosis, we divide total samples under the guidance of relevant experts, with 49% categorized as atherosclerosis risk group and 51% as a control group without atherosclerosis risk at the end.

In consultation with a medical professional and based on relevant tests, we refine the group classification to assess patients at risk for hypertension, cardiovascular, chronic kidney and hyperlipidemia (HL), which constitute the risk group for this study. Patients with hyperlipidemia (HL) are defined as having a low-density lipoprotein (LDL) level (≥ 3.36 mmol/L), and/or a total cholesterol (TC) level (≥ 5.17 mmol/L), and/or a triglyceride (TG) level (≥ 1.69 mmol/L). Healthy controls are mainly selected from the same period of physical examination, but people with abnormal hemoglobin or history of previous cardiovascular events are excluded, and People with cancer, diabetes or autoimmune diseases are the same as the former.

Through data analysis, there are 8 discrete features and 26 continuous features in the dataset. Since most of the features have missing values, the missing situation of each feature needs to be analyzed specifically before doing missing value processing. After preliminary statistics, we find that 2 features do not have missing values, and the remaining 30 features have different degrees of missing. In general, the treatment of missing values is to remove the features with more than 30% missing proportion, and filling the missing values of all features will bring some problems such as presence of biased information on certain extreme pathologies, large filling errors. Therefore, features with more missing values are selectively eliminated and 25 features are retained as shown in Table 2. Considering that the missing proportion is not high in the remaining characteristics and the difficulty of each filling method, this paper uses statistical methods to fill the data with fewer missing values. Specifically, continuous variables are filled with the median if they are skewed, otherwise they are filled with the mean; for discrete variables, the plural is chosen to fill the missing values. At last, we got 622 samples with all the features. Under the guidance of relevant experts, 304 samples are seen as atherosclerosis risk group and 318 samples are seen as a control group without atherosclerosis risk. In addition, the data is randomly divided into a training set of 0.7 and a test set of 0.3 in preparation for the cross-validation.

Table 2. Twenty-five features affecting atherosclerosis and statistical values of them.

Features	Average value	Standard deviation	p-value
Sex			0.056
Age	55.7540	14.4806	< 0.001
BMI	23.8016	3.4114	< 0.001
Triglycerides	1.5360	1.4969	< 0.001
Total cholesterol	4.7253	1.9817	0.004
Glucose	5.3104	1.3383	< 0.001
Uric acid	530.9794	195.3352	0.754
Haemoglobin	130.9871	22.9056	< 0.001
white blood cell count	6.2310	3.0117	0.022
Red blood cell count	4.5991	5.2696	< 0.001
Platelet count	195.4817	59.5907	0.477
Glutathione aminotransferase	24.6704	19.1175	0.440
Glutathione transaminase	22.7894	12.2302	0.896
HDL	1.3621	0.3759	< 0.001
LDL	2.5349	0.7251	< 0.001
Systolic blood pressure	134.3746	21.3627	< 0.001
Diastolic blood pressure	78.2170	12.3026	0.003
LCCA-IMT	0.0627	0.0563	< 0.001
RCCA-IMT	0.0584	0.0365	< 0.001
LCCA-RI	0.7167	0.0693	0.023
RCCA-RI	0.7272	0.0893	0.017
LCCA-BS	6.5665	2.7185	< 0.001
LCCA-ES	8.9234	2.4416	< 0.001
RCCA-BS	6.0710	1.4666	< 0.001
RCCA-ES	8.4066	2.2626	< 0.001

| Show Table

DownLoad: CSV

2.2. Statistical analysis

In this paper, when the data is statistically processed, continuous type characteristics are expressed as 'mean $\pm$ standard deviation', and for the presence of atherosclerosis risk, we take independent sample t-test for the difference comparison between these two groups; while discrete type characteristics were expressed as counts and percentages, the difference comparison between groups was done using the chi-square test with p < 0.05 used as the inspection standard. In the full sample after analysis, the basic situation is shown in Table 2.

As seen in Table 2, using the idea of label encoding, we quantified the gender index as 1 for male and 0 for female. Also, age, BMI (body mass index), triglycerides, total cholesterol, glucose, hemoglobin, white blood cell count, red blood cell count, HDL (High-density lipoprotein), LDL (Low-density lipoprotein), systolic blood pressure, diastolic blood pressure, LCCA-IMT (Left common carotid artery intima-media thickness), RCCA-IMT (Right common carotid artery intima-media thickness), LCCA-RI (Resistance indices of blood flow in the left common carotid artery), RCCA-RI (Resistance indices of blood flow in the right common carotid artery), LCCA-BS (Pulse wave conduction velocity at the beginning of systole in the left common carotid artery), LCCA-ES (Pulse wave conduction velocity at the end of systole in the left common carotid artery), RCCA-BS (Pulse wave conduction velocity at the beginning of systole in the right common carotid artery) and RCCA-ES (Pulse wave conduction velocity at the end of systole in the right common carotid artery) passed the statistical test. We finally selected 20 features while the remaining features were screened down because they did not meet the statistical requirements.

2.3. Predictive performance metrics

We assess our model through a range of relative metrics. The ROC curve is the working characteristic curve of the model being measured, and is a visualization of the relationship between the relevant indicators of the continuous variables of sensitivity and specificity of the model in the form of an image. AUC means the area under the ROC curve. It is a metric often used to evaluate the merits of a dichotomous model, with higher values indicating better prediction.

However, the AUC value only evaluates the overall training effect of the model, and does not reflect how to divide the categories to make the best prediction. In our experiments, we use the KS (Kolmogorov-Smirnov) statistic ^[24] to evaluate the classification effectiveness of the model. For dichotomous classification problems, the KS value, like the AUC value, uses two metrics, TPR and FPR, to measure the overall predictive effectiveness of the model. And KS uses the maximum of the difference between TPR and FPR to indicate the optimal classification threshold as shown in Figure 1. Moreover, we use sensitivity-specificity curve and precision-recall curve to Calibrate model.

${\text{K}}{{\text{S}}_{\max }} = \max \left( {TPR - FPR} \right)$

(1)

Figure 1. Schematic diagram of the KS curve.

DownLoad: Full-Size Img PowerPoint

2.4. Predictive model based on RFC

Random Forest (RF) ^[25] is an optimized version of the bagging algorithm and it was used in many different areas including banking, stock markets, pharmaceuticals and e-commerce. In healthcare, ingredients combined correctly in a drug can be identified by this method, and diseases can be accurately identified based on the patient's medical history. As a result, we build predictive models about the risk of atherosclerosis based on RF for intelligent classification.

2.5. Random forest classifier

We start by splitting 70% of the data into a training set and 30% into a test set, and ensure that all samples are equally likely to be selected for the training set. If the predictions obtained after such multiple equal-likelihood sampling are stable, it means that the current model is feasible. Once the dataset has been partitioned, the random forest is used to train the model.

Single decision trees can be difficult to achieve high accuracy, mainly because solving an optimal (minimum generalization error) decision tree is an NP-hard (unable to exhaust all possible tree structures) problem, and often results in a locally optimal solution. Models constructed from a single tree are often not stable enough, and changes in the sample can easily cause changes in the tree structure.

We use the idea of bagging algorithm to divide the training set into several training subsets randomly, and then build decision trees on each subset separately. In the process of building each decision tree, the idea of feature subspace is introduced into it, that is, for each node of the decision tree in choosing to divide the features to achieve the best, its candidate feature set is not all the features at the corresponding node, but a subset composed of some randomly selected features from it. The final prediction result is derived from the voting result of each decision tree.

We select T sample sets containing m training samples by self-sampling, and then train a base learner based on each sample set before combining them. When combining and regressing the predicted outputs together, the Bagging-like approach usually uses simple voting for the classification task, i.e., a simple average weighted summation method is performed. However, if the number of votes is consistent, it is simplest to select a class at random, or of course to further examine the confidence of the learner votes to determine the final classification. Their exact process is as follows, for a given dataset containing n training samples $D = \left\{ {{x_1}, {x_2}, \cdots, {x_n}} \right\}$ ,

1) A new training set containing n samples is formed using Bootstrap with put-back sampling;

2) Repeat 1) for T times to obtain T training sets;

3) T base classifiers trained independently on each training set using classification algorithm;

4) For each test sample, T predictions are obtained using the above T classifiers;

5) For each test sample, a majority vote is used to obtain the final prediction.

The exact process is shown in Figure 2.

Figure 2. Flow chart of the random forest model.

DownLoad: Full-Size Img PowerPoint

2.6. Measurement of feature importance

The Gini index demonstrates the chance of misclassifying a randomly selected sample in the sample set, and a smaller Gini index indicates a lower probability of being misclassified, i.e., a higher sample purity (when all samples in the set are of one type, the Gini index is 0). The Gini index is calculated as follows.

$Gini(p) = 1 - \sum\limits_{k = 1}^K {p_k^2} = \sum\limits_{k = 1}^K {{p_k}(1 - {p_k})}$

(2)

where ${p_k}$ denotes the chance that the selected sample belongs to the kth category. If the certain node has the smallest Gini index, it is also the least likely to make a mistake. Therefore, we use this node as the root node of the decision tree. In the consequent forest, the Gini index represents the fineness of the model, negatively correlated with purity and characteristics.

Let the nodes in the jth decision tree in which a feature appears be the set M. Then the importance of the feature under the jth decision tree is:

$A_{ij}^{(Gini)} = \sum {_{m \in M}(Gini(m) - Gini(i) - Gini(r))}$

(3)

where Gini(i) and Gini(r) refer to the Gini index corresponding to the two new nodes of node m after branching, respectively. Finally, the importance of each feature under the T decision tree is calculated as follows.

${A_i} = \frac{{\sum\nolimits_{i = 1}^T {A_{ij}^{(Gini)}} }}{{\sum\nolimits_{j = 1}^c {\sum\nolimits_{i = 1}^T {A_{ij}^{(Gini)}} } }}$

(4)

where ${A_i}$ is the magnitude of the importance of the object corresponding to the ith feature, after normalization. The importance of selected features is shown in Table 3.

Table 3. Fifteen selected features and importance ranking obtained from random forest.

	%IncMSE	IncNodePurity
Age	0.006542	3.688916
BMI	0.00203	3.493642
Triglycerides	0.000509	2.862369
Total cholesterol	0.018734	6.030064
white blood cell count	0.003258	3.490238
HDL	0.045132	15.41302
LDL	0.084491	20.03696
Systolic blood pressure	0.00936	4.092086
Diastolic blood pressure	0.006558	3.329568
LCCA-IMT	0.014924	6.732724
RCCA-IMT	0.041935	13.04469
LCCA-BS	0.010674	4.803951
LCCA-ES	0.00215	3.041142
RCCA-BS	0.017512	7.513578
RCCA-ES	0.007113	5.189474

| Show Table

DownLoad: CSV

3. Feature selection (FS) inspired from operations research

3.1. Correlation distances and redundancy between features

To improve the identification accuracy of this atherosclerosis risk predictive model, feature quantities need to be removed that are not relevant to the classification target. We screened out the relevant features by statistical tests followed by feature redundancy needs to be considered. Feature redundancy refers to the correlation between features. Redundant feature quantities that have a high correlation with other feature quantities will also have a significant impact on this model, and if two features are perfectly correlated, they are mutually redundant features. We will model the optimal correlation distance based on the correlation distance between the features by transforming it into an NP problem in operations research.

We can calculate and select features for the model by using a measure of relevance. Similarity measurement can be achieved by calculating the distance between individual features.

Estimating similarity measures between samples is often done in many research questions, also known as correlation coefficients. This is often accomplished by calculating the distance between samples, and the method used to calculate the distance depends on the correctness of the feature classification and feature selection. We define the correlation coefficient as:

${\rho _{XY}} = \frac{{Cov\left( {X, Y} \right)}}{{\sqrt {D\left( X \right)} \sqrt {D\left( Y \right)} }} = \frac{{E\left( {\left( {X - E\left( X \right)} \right)\left( {X - E\left( Y \right)} \right)} \right)}}{{\sqrt {D\left( X \right)} \sqrt {D\left( Y \right)} }}$

(5)

The correlation distances are defined as follows:

${D_{xy}} = 1 - {\rho _{XY}}$

(6)

The Pearson Correlation Coefficient between features is shown in Figure 3.

Figure 3. Pearson Correlation Coefficient between 15 features after FS.

DownLoad: Full-Size Img PowerPoint

3.2. Optimal distance model based on Dijkstra

In the fields of data structures, topological sorting and algorithms, transforming an operations and optimization problem into a graph theoretic problem can help us a lot. Graphs are the basis of graph theory problems, and we study each feature as 20 nodes in the graph set up for this problem, where we can describe the labelled values on each edge in terms of weights, i.e., each edge has a unique corresponding value, and study the graph as a weighted graph. Where the unique weight of each path is the distance associated with Eq (6), we can use a ternary group $G = \left({V, E, W} \right)$ to represent the entitled undirected graphs to be solved in this paper, where W is the correlation function between the nodes, i.e., the matrix ${W_{20 \times 20}}$ consisting of the correlation distances ${D_{xy}}$ between the nodes.

We can extract combinations of different nodes through feature filtering to obtain better results for atherosclerosis prediction models. We combine the ideas of computer sampling search and the shortest distance in graph theory problems to build an optimal distance model based on the solution of the Dijkstra algorithm. When we need to select k features, we can assume that the selection of features is done from a certain node. We obtain the optimal set of features from this node by selecting the k-1 features that have less redundancy with each other after composing the feature set. Then by traversing the computer from each node and comparing them, we can obtain the best set of filtered features when k features need to be selected.

We will choose the Dijkstra algorithm ^[26] for optimal route solving. We introduce an auxiliary array D while each element D[i] (1 < i < 20) is used to represent the currently recorded distance from the starting node to the other nodes. We assume the initial state of D: if we go from the starting ith node to the jth node, then $D = {D_{ij}}$ , i.e., representing the size of the weights on the edges by the path from the ith node to the jth node. The role of Dijkstra algorithm is solving the shortest path in graph theory. In order to pursue the optimal path, we reduce the impact of redundant features on the model by finding the node with the greatest distance from the relevance of this node. Indeed, it is a process of finding the longest path. D ^[1] is the length of the path from the origin to the node with the greatest distance. We denote all the reached node as the set S, then the shortest path to the next farthest and non-S node t, that is, D ^[2], and so on, to obtain the objective function max: $\max {\text{ }}\sum\nolimits_{i = 1}^k {D[i]}$ . The specific process, as shown in flow chart Figure 4a.

Figure 4. a. Flow chart of the Dijkstra algorithm. b. Flow chart of feature selection based on optimal distance model.

DownLoad: Full-Size Img PowerPoint

Using Dijkstra algorithm, we can obtain the longest path contained the optimal set of features of the ith node when k features are selected. At this point, the computer traversal is then used to obtain the different sets of optimal features when starting from the 1st to the 20th node. Because we want to strengthen the accuracy of the feature screening results, we then perform the selection from 1 to 20 features by traversal and use them in a random forest-based atherosclerosis prediction model, evaluated by KS statistics, and finally obtain the objective function $\max {\text{ }}\sum\nolimits_{k = 1}^{20} {K{S_k}}$ . The final optimal set of features and its KS metric results are obtained. The specific model is as follows.

$\begin{array}{l} \max \sum\limits_{k = 1}^{20} {K{S_k}} \\ s.t.\left\{ \begin{array}{l} G = \left( {V, E, W} \right) \hfill \\ {{D'}_k} = \max \sum\limits_{i = 1}^k {D[i]} \hfill \\ \end{array} \right. \\ \end{array}$

(7)

At this point, we obtain the set of features that can make the random forest-based atherosclerosis prediction model optimal, and complete the feature screening, the specific steps of which are shown in Figure 4b.

We obtain the optimal feature set by comparing the KS statistics of each feature set based on the optimal distance model solved by Dykstra's algorithm to complete the feature selection. We finally select the optimal special feature set under 15 features, and the variation of KS value for each feature set can be obtained in Figure 5a. Its KS value of 0.688 is better than the rest of the feature classes, KS curve is shown in Figure 5b and the optimal set of features is shown in Table 3.

Figure 5. a. variation in KS of the optimal set for different number of features. b. KS curves under the optimum set.

DownLoad: Full-Size Img PowerPoint

3.3. Ensemble learning (EL)

The prediction model above is constructed based on the 15 central feature sets screened by the feature selection model. Although it reduces the complexity of the model, a certain degree of information on variables that are not screened out is lost, which in turn reduces the accuracy of the prediction model.

With the purpose of further enhancing the model accuracy, this paper uses ensemble learning to make the most of the data information embodied in the four non-central variables that were discarded. The basic idea is to model the predictions of the 15 selected central features and the 5 discarded non-central features separately, and then weight the predictions of the two sets of variables to obtain the final prediction results modified by ensemble learning. The result will contain the information of the 5 variables that are initially discarded.

3.4. Bayesian network

Bayesian network ^[27] is directed acyclic graphs consisting of nodes representing features and directed edges representing the interrelationships between the nodes. The direction of the directed edge points from the parent node to the child node. Dependencies between features are expressed through conditional probabilities, and features without a parent node are expressed informally through their prior probabilities. As Bayesian networks express the causal relationships between characteristic variables in a visual framework structure, they can make the logic of uncertainty between variables clearer and better interpreted.

In our work, Bayesian network is constructed for each of the 15 selected central variable sets and the remaining 5 non-central variable sets. The sparsity of the Bayesian network measures the redundancy of feature information in each of the two feature sets divided.

As can be seen from Figure 6, after the filtering of the feature selection algorithm based on the optimal correlation distance, the Bayesian network structure of the divided two feature subsets is simple and the information redundancy between the features has been effectively reduced.

Figure 6. a. Bayesian network of 15 selected central features. b. Bayesian network of 5 non-central features.

DownLoad: Full-Size Img PowerPoint

3.5. Ensemble learning based on the search method

The aim of ensemble learning is to integrate the information described by the set of 15 central features with the set of 5 non-central variables through a certain weighting process, making full use of the information of the 5 non-central features that are screened out, so that the prediction model has a higher accuracy. The prediction model under ensemble learning can be expressed as:

${r_{learning}} = {w_1}{r_1} + {w_2}{r_2}$

(8)

where ${r_{learning}}$ is the prediction result after ensemble learning, ${r_1}$ and ${r_2}$ are the prediction results for the central and non-central sets of variables, respectively, also ${w_1}$ and ${w_2}$ are the weights for the central and non-central sets of variables, respectively, which satisfy the following constraint:

$\left\{ \begin{array}{l} {w_1} > {w_2} \hfill \\ {w_1} + {w_2} = 1 \hfill \\ \end{array} \right.$

(9)

The basic idea of the search method is to search for the set of weights that has the largest value of the model AUC while satisfying the constraint in Eq (9). The optimal combination of weights searched in steps of 0.1 is ${w_1}$ = 0.7, ${w_2}$ = 0.3.

4. Results and discussion

The common methods, t-test and chi-square test, used in the paper yielded 20 of the 25 characteristics with statistically significant effects. Some of these 20 features are significantly correlated, so this paper uses the feature selection method to obtain 15 features.

We use AUC values to quantify the optimization of model prediction performance between common methods (abbreviated as CM), FS and EL. After several experiments with randomly assigned test sets and training sets, the model AUC is stable with no overfitting occurred confirming that the random forest-based prediction model works well. Then sensitivity-specificity curve and precision-recall curve are plotted to calibrate model. After FS, AUC is turned out to be 0.8826, resulting in an improvement of 0.052 compared to AUC without FS in the predictive performance of the model. After ensemble learning, the AUC is further increased to 0.896, an improvement of 0.091. The improvement in AUC for both steps is shown in the

Figure 7a and the sensitivity-specificity curve and precision-recall curve are plotted in

Figure 7. a. AUC of CM-RF, FS-RF and EL. b. Precision-recall curve of CM-RF, FS-RF and EL. c. Sensitivity-specificity curve of CM-RF, FS-RF and EL. d. AUC of our FS-Rf compared to PCA and FA.

DownLoad: Full-Size Img PowerPoint

Figure 7b, c. This shows that the work in this paper is really effective.

In order to show that the method of feature selection in this paper is superior to other dimensionality reduction methods, we select the dimensionality reduction methods of principal component analysis (PCA) and factor analysis (FA), and use the AUC values obtained from the random forest classifier to compare the effects. PCA can regroup the original variables into a new set of uncorrelated composite variables, and a few selected composite variables can better reflect the information of the original variables. FA is a statistical technique to study the extraction of common factors from a population of variables. Both PCA and FA are more classical methods for dimensionality reduction, but the effect of PCA and FA is much lower than that of feature selection in this paper, as shown in Figure 7d.

Moreover, we review the medical literature related to atherosclerosis. HDL levels correlate strongly with atherosclerosis ^[28]. While increased IMT is an early clinical manifestation of atherosclerosis ^[29]. To further demonstrate the advantages of the prediction methods used in this paper, it is necessary to evaluate these single indicators in predicting atherosclerotic outcomes that have been shown to have a direct association with atherosclerosis. The ROC curves for HDL and IMT used to predict atherosclerosis respectively and the 25 feature sets prior to feature selection are shown in Figure 8.

Figure 8. The ROC curves for HDL and IMT used to predict atherosclerosis respectively and the 25 feature sets prior to feature selection.

DownLoad: Full-Size Img PowerPoint

5. Conclusions

The work in this paper is based on research work on atherosclerosis and the guidance of professionals. The original data set is preprocessed and statistically analyzed through relevant statistical analyses and tests. We build an atherosclerosis prediction model based on random forest classifier with good results. As the result of the comparison between the ROC curves for HDL/IMT and the 25 feature sets prior to feature selection, Even the use of a single medically proven indicator with a strong correlation with atherosclerosis for prediction is far from the predictive effect of our model. Further evidence of the great role that feature redundancy plays in model prediction is provided. Then we transform the data screening problem into an optimization problem based on optimal paths. We obtain an optimized feature set containing 15 features by building an optimal distance model solved based on Dijkstra algorithm, whose model effects are optimized. Finally, it is then boxed and the feature set is discretized. The final model yielded an AUC metric of 0.9170, an improvement of 0.0472 from the initial one. This illustrates that the optimal distance feature screening model proposed in this paper improves the performance of the atherosclerosis prediction model in terms of both prediction accuracy and AUC metrics.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (11771216), the Key Research and Development Program of Jiangsu Province (Social Development) (BE2019725), NUIST Students' Platform for Innovation and Entrepreneurship Training Program (XJDC202110300552), and the Undergraduate Innovation & Entrepreneurship Training Program of Jiangsu Province (202110300098Y).

Conflict of interest

All authors declare no conflicts of interest in this paper.

References

[1]	Amini S, Mohamed A, Schwienbacher A, et al. (2020), The impact of venture capital holding on the firms' life-cycle: Evidence from IPO Firms, mimeo.
[2]	Barry CB, Mihov VT (2015) Debt financing, venture capital, and the performance of initial public offerings. J Bank Financ 58: 144-165. doi: 10.1016/j.jbankfin.2015.04.001
[3]	Barry CB, Muscarella CJ, Peavy Iii JW, et al. (1990) The role of venture capital in the creation of public companies: Evidence from the going-public process. J Financ Econ 27: 447-471. doi: 10.1016/0304-405X(90)90064-7
[4]	Brav A, Gompers PA (1997) Myth or reality? The long-run underperformance of initial public offerings: Evidence from venture and nonventure capital-backed companies. J Financ 52: 1791-1821. doi: 10.1111/j.1540-6261.1997.tb02742.x
[5]	Bruton GD, Filatotchev I, Chahine S, et al. (2010) Governance, ownership structure, and performance of IPO firms: The impact of different types of private equity investors and institutional environments. Strat Manage J 31: 491-509.
[6]	Carpenter MA, Pollock TG, Leary MM (2003) Testing a model of reasoned risk-taking governance, the experience of principals and agents, and global strategy in hightechnology IPO firms. Strat Manage J 24: 803-820. doi: 10.1002/smj.338
[7]	Carter RB, Dark FH, Singh AK (1998) Underwriter reputation, initial returns, and the long-run performance of IPO stocks. J Financ 53: 285-311. doi: 10.1111/0022-1082.104624
[8]	Carter R, Manaster S (1990) Initial public offerings and underwriter reputation. J Financ 45: 1045-1067. doi: 10.1111/j.1540-6261.1990.tb02426.x
[9]	Chen HK, Liang WL (2016) Do venture capitalists improve the operating performance of IPOs? Int Rev Econ Financ 44: 291-304.
[10]	Chen YR, Chuang WT (2009) Alignment or entrenchment? Corporate governance and cash holdings in growing firms. J Bus Res 62: 1200-1206. doi: 10.1016/j.jbusres.2008.06.004
[11]	Fired VH, Bruton GD, Hisrich RD (1998) Strategy and the board of directors in venture capital based firms. J Bus Venturing 13: 493-503. doi: 10.1016/S0883-9026(97)00062-1
[12]	DeAngelo H, DeAngelo L (2006) The Irrelevance of the MM Dividend Irrelevance Theorem. J Financ Econ 79: 293-315. doi: 10.1016/j.jfineco.2005.03.003
[13]	DeAngelo H, DeAngelo L, Stulz R (2006) Dividend policy and the earned/contributed capital mix: a test of the life-cycle theory. J Financ Econ 81: 227-254. doi: 10.1016/j.jfineco.2005.07.005
[14]	Eckbo BE, Norli Ø (2005) Liquidity risk, leverage and long-run IPO returns. J Corp Financ 11: 1-35. doi: 10.1016/j.jcorpfin.2004.02.002
[15]	Hermalin BE (2001) Economics and Corporate Culture, In: Cary L. Cooper, Sue Cartwright, and P. Christopher Earley, eds.: The International Handbook of Organizational Culture and Climate, John Wiley & Sons, Chichester, England.
[16]	Jain BA, Shekhar C, Torbey V (2009) Payout initiation by IPO firms: The choice between dividends and share repurchases. Q Rev Econ Financ 49: 1275-1297. doi: 10.1016/j.qref.2009.09.003
[17]	Kreps DM (1990) Corporate Culture and Economic Theory, in James E. Alt, and Kenneth.
[18]	Lemmon ML, Roberts MR, Zender JF (2008) Back to the beginning: persistence and the cross-section of corporate capital structure. J Financ 63: 1575-1608. doi: 10.1111/j.1540-6261.2008.01369.x
[19]	Krishnan CNV, Ivanov VI, Masulis RW, et al. (2011) Venture Capital Reputation, and Corporate Governance. J Financ Quant Anal 46: 1295-1333. doi: 10.1017/S0022109011000251
[20]	Lerner J (1994) Venture Capitalists and the Decision to Go Public. J Financ Econ 35: 293-316. doi: 10.1016/0304-405X(94)90035-3
[21]	Lee PM, Wahal S (2004) Grandstanding, certification and the underpricing of venture capital backed IPOs. J Financ Econ 73: 375-407. doi: 10.1016/j.jfineco.2003.09.003
[22]	Loughran T, Ritter JR (2002) Why don't issuers get upset about leaving money on the table in IPOs? Rev Financ Stud 15: 413-443. doi: 10.1093/rfs/15.2.413
[23]	Alt JE, Shepsle KA (1990) Perspectives on Positive Political Economy. Cambridge University Press.
[24]	Myers S, Majluf N (1984) Corporate Financing and Investment Decisions When Firms Have Information that Investors Do Not Have. J Financ Econ 13: 187-221. doi: 10.1016/0304-405X(84)90023-0
[25]	Megginson WL, Weiss K (1991) Venture capitalists certification in initial public offerings. J Financ 46: 879-903. doi: 10.1111/j.1540-6261.1991.tb03770.x
[26]	Nilsson M, Cronqvist H, Low A (2009) Persistence in Firm Policies, Firm Origin, and Corporate Culture: Evidence from Corporate Spinoffs. Robert Day School Working Paper No. 2009-2.
[27]	Phillips BD (1991) The increasing role of small firms in the high-technology sector: evidence from the 1980s. Bus Econ 26: 40-77.
[28]	Pommet S (2017) The impact of the quality of VC financing and monitoring on the survival of IPO firms. Managerial Financ 43: 440-451. doi: 10.1108/MF-06-2016-0178
[29]	Van den Berghe LAA, Levrau A (2002) The role of the venture capitalist as monitor of the company: a corporate governance perspective. Corp Govern Int Rev 10: 124-135. doi: 10.1111/1467-8683.00278
[30]	Wasserman J (1988) Impact of venture capital on high-technology industries. Rev Bus 10: 5-6.
[31]	White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica 48: 817-838. doi: 10.2307/1912934

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Quantitative Finance and Economics

2.5 2.0

Metrics

Article views(2278) PDF downloads(119) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Quantitative Finance and Economics

Venture capital backing: financial policies and persistence over time