Investigating the influence of ownership on the relationship between bank capital and the cost of financial intermediation

Changjun Zheng; Md Mohiuddin Chowdhury; Anupam Das Gupta; Changjun Zheng; Md Mohiuddin Chowdhury; Anupam Das Gupta

doi:10.3934/DSFE.2024017

Data Science in Finance and Economics

2024, Volume 4, Issue 3: 388-421. doi: 10.3934/DSFE.2024017

Previous Article Next Article

Research article

Investigating the influence of ownership on the relationship between bank capital and the cost of financial intermediation

1.
School of Management, Huazhong University of Science and Technology, Wuhan, 430074, China
2.
Department of Finance, University of Chittagong, Chattogram-4331, Bangladesh

Received: 13 February 2024 Revised: 19 June 2024 Accepted: 26 June 2024 Published: 07 August 2024
JEL Codes: C30, G21, G32

This study investigated the simultaneous association between capital and the cost of financial intermediation (COFI) by bridging the gap of ownership effects on the nexus between capital and COFI. This study revealed several significant insights by using data from 44 commercial banks in Bangladesh between 2010 and 2021 and applying two-step system generalized methods of moments (2SGMM). First, a significant nonlinear bidirectional relationship exists between bank capital and COFI. The tendency to generate average and low COFI enables banks to acquire more capital than those with high COFI. In contrast, banks with high and average capital bases can maximize their COFI compared to low ones. Second, state-owned and conventional commercial banks are better positioned to source more capital. However, state-owned and Islamic commercial banks can strengthen the inverted U-shaped relationship between COFI and bank capital than private-owned and Islamic commercial banks. Finally, state-owned commercial banks do not experience the same benefits in COFI from capital increases as privately owned banks. Unlike Islamic commercial banks, conventional banks generate more COFI in the long run as capital rises. The findings provide helpful insights into shaping policy and regulations regarding emerging country's banking systems, especially capital, COFI, and ownership policies.

Keywords:

Citation: Changjun Zheng, Md Mohiuddin Chowdhury, Anupam Das Gupta. Investigating the influence of ownership on the relationship between bank capital and the cost of financial intermediation[J]. Data Science in Finance and Economics, 2024, 4(3): 388-421. doi: 10.3934/DSFE.2024017

Related Papers:

[1]	Ning Huang, Zhengtao Xi, Yingying Jiao, Yudong Zhang, Zhuqing Jiao, Xiaona Li . Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals. Mathematical Biosciences and Engineering, 2024, 21(8): 6918-6935. doi: 10.3934/mbe.2024304
[2]	Yanling An, Shaohai Hu, Shuaiqi Liu, Bing Li . BiTCAN: An emotion recognition network based on saliency in brain cognition. Mathematical Biosciences and Engineering, 2023, 20(12): 21537-21562. doi: 10.3934/mbe.2023953
[3]	MingHao Zhong, Fenghuan Li, Weihong Chen . Automatic arrhythmia detection with multi-lead ECG signals based on heterogeneous graph attention networks. Mathematical Biosciences and Engineering, 2022, 19(12): 12448-12471. doi: 10.3934/mbe.2022581
[4]	Hanming Zhai, Xiaojun Lv, Zhiwen Hou, Xin Tong, Fanliang Bu . MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion. Mathematical Biosciences and Engineering, 2023, 20(8): 14096-14116. doi: 10.3934/mbe.2023630
[5]	Shi Liu, Kaiyang Li, Yaoying Wang, Tianyou Zhu, Jiwei Li, Zhenyu Chen . Knowledge graph embedding by fusing multimodal content via cross-modal learning. Mathematical Biosciences and Engineering, 2023, 20(8): 14180-14200. doi: 10.3934/mbe.2023634
[6]	Chun Li, Ying Chen, Zhijin Zhao . Frequency hopping signal detection based on optimized generalized S transform and ResNet. Mathematical Biosciences and Engineering, 2023, 20(7): 12843-12863. doi: 10.3934/mbe.2023573
[7]	Jun Wu, Xinli Zheng, Jiangpeng Wang, Junwei Wu, Ji Wang . AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis. Mathematical Biosciences and Engineering, 2023, 20(10): 18523-18544. doi: 10.3934/mbe.2023822
[8]	Zhangjie Wu, Minming Gu . A novel attention-guided ECA-CNN architecture for sEMG-based gait classification. Mathematical Biosciences and Engineering, 2023, 20(4): 7140-7153. doi: 10.3934/mbe.2023308
[9]	Xuelin Gu, Banghua Yang, Shouwei Gao, Lin Feng Yan, Ding Xu, Wen Wang . Application of bi-modal signal in the classification and recognition of drug addiction degree based on machine learning. Mathematical Biosciences and Engineering, 2021, 18(5): 6926-6940. doi: 10.3934/mbe.2021344
[10]	Yongmei Ren, Xiaohu Wang, Jie Yang . Maritime ship recognition based on convolutional neural network and linear weighted decision fusion for multimodal images. Mathematical Biosciences and Engineering, 2023, 20(10): 18545-18565. doi: 10.3934/mbe.2023823

Abstract

1. Introduction

With the development and progress of society, along with the increasing pace of life, the incidence of depression and anxiety has become increasingly common in everyday life. Depression, also known as major depressive disorder (MDD), is classified according to the diagnostic criteria for depressive disorders outlined in the Fifth Edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) ^[1]. The severity of depression is categorized into mild, moderate, and severe depressive disorders. Mild depressive disorder is clinically characterized by symptoms such as feelings of sadness, loss of interest, fatigue, and difficulty concentrating; however, these symptoms do not significantly interfere with the individual's daily life. Moderate depressive disorder is associated with more severe symptoms, including profound sadness, feelings of helplessness, low self-esteem, sleep disturbances, and changes in appetite. Severe depressive disorder presents with more intense manifestations, such as hopelessness, suicidal ideation, significant changes in sleep and appetite, and an inability to concentrate or perform daily activities.

The 2021 Global Mental Health Insights Report ^[2] indicates that over 300 million individuals globally are affected by depression, with the number of depression cases increasing by approximately 18% over the past decade. This suggests that about one in five individuals globally will experience depression at some point in their lives, with a lifetime prevalence of 15%–18%. The suicide rate associated with depression is estimated to be between 4.0% and 10.6%. According to the 2022 China Depression Blue Book ^[3] published by People's Daily, the lifetime prevalence of depressive disorders among adults in China is 6.8%, with approximately 280,000 suicides occurring annually in the country, of which 40% are linked to depression. The burden of mental illness worldwide has become more severe following the COVID-19 pandemic, with an additional 53 million cases of depression reported globally, representing a 27.6% increase. Additionally, instances of severe depression and anxiety have risen by 28% and 26%, respectively. Timely and effective detection of depression is not only a crucial step in improving public health but also a key measure in reducing the global mental health burden and preventing suicidal behavior. Advancing research and application in depression detection can enable early intervention, enhance treatment success rates, and ultimately improve the quality of life and mental well-being for millions of individuals worldwide.

In recent years, significant progress has been made in prediction tasks based on graph convolutional networks (GCNs). Researchers have extended GCNs to the field of disease detection. EEG signals are typically acquired from multiple electrodes, and the spatial structure between these electrodes exhibits strong dependencies. By representing EEG signals as graph structures, where electrodes are treated as nodes and spatial relationships between electrodes as edges, graph neural networks (GNNs) can effectively capture both local and global dependencies between electrodes through graph convolution operations. This approach is particularly effective for processing spatial information in EEG signals and modeling the complex relationships between brain regions ^[4].

Research has shown that spectral features perform well in speech signal recognition tasks ^[5]. Time-frequency representations reveal the frequency components of a signal that vary over time, making them especially effective for handling non-stationary signals. Traditional Fourier transforms provide global spectral information, but they ignore temporal dynamics. A continuous wavelet transform allows for multi-scale, multi-resolution time-frequency analysis. Smith et al. ^[6] used the Margenau-Hill transform to extract time-frequency domain features from EEG signals. The Margenau-Hill transform provides a localized representation of the signal in both time and frequency domains, making it more suitable for non-stationary signals compared to traditional Fourier transforms. However, in some cases, there may be a trade-off between time resolution and frequency resolution, which complicates the analysis. El-Sayed et al. ^[7] utilized recursive graphs to extract deep features from PPG signals, demonstrating that recursive graphs can visualize the recursive states of time series, helping to identify periodicities and patterns in the signals. However, recursive graphs are sensitive to noise and data quality, and noisy signals may lead to misleading patterns in the graphs. Siuly et al. ^[8] employed a wavelet scattering transform (WST) to capture time-frequency features of EEG signals, showing the superiority of time-frequency representations in capturing the characteristics of EEG signals. To further explore the advantages of time-frequency representations in physiological signal feature extraction, Smith et al. ^[9] compared several time-frequency methods, including the short-time Fourier transform, continuous wavelet transform, Zhao-Atlas-Marks distribution, and smoothed pseudo Wigner-Ville distribution (SPWVD). Recursive graphs can visualize the recursive state of time series, aiding in the identification of periodicity and patterns in the signal. However, recursive graphs are sensitive to noise and data quality, as noise can lead to misleading patterns in the graph. Compared to more complex time-frequency analysis methods, such as wavelet transforms or the Wigner-Ville distribution, the short-time Fourier transform (STFT) divides the signal into small windows, performing Fourier transforms within each window, thus preserving both time and frequency information while avoiding neglecting the time dimension. The STFT has lower computational cost and provides sufficient signal features while balancing time and frequency resolution. Therefore, we employ the short-time Fourier transform (STFT) to extract spectral features from speech signals, convert the frequency (y-axis) to a logarithmic scale, and use the color dimension to generate spectrograms. Subsequently, the y-axis is mapped to the mel scale to generate a mel spectrogram.

Vision transformer (ViT), a deep learning architecture based on the self-attention mechanism, has a powerful ability to model global dependencies. Traditional convolutional neural networks (CNNs) typically rely on local receptive fields to extract features, whereas a ViT leverages self-attention to model long-range dependencies between different positions, which is particularly important for time-frequency representations of EEG and speech signals. Time-frequency representations of EEG and speech signals often contain intricate interwoven frequency bands and complex time-frequency features. A ViT can effectively capture these complex spatiotemporal relationships, extracting deep features that are useful for tasks such as classification ^[10].

Based on this, the present study aims to combine GCNs and transformer models for multimodal depression detection based on EEG and speech singals. The main contributions of this study are as follows:

(1) Proposing a multimodal depression detection model (MHA-GCN_ViT): This study combines EEG and speech signals by utilizing GCNs and vision transformers to effectively extract and fuse the spatiotemporal, time-frequency features of the EEG and the time-frequency features of speech signals, thereby improving the accuracy of multimodal depression detection.

(2) Feature extraction using a discrete wavelet transform (DWT) and short-time Fourier transform (STFT): The study uses a DWT to extract discrete wavelet features from the EEG and construct a brain network structure, while a STFT is employed to extract time-frequency features from both EEG and speech signals, including mel spectrogram features.

(3) Introducing multi-head attention to enhance the brain network representation of the GCN: This model incorporates multi-head attention with GCNs to capture complex relationships between different EEG channels, thereby enhancing the GCN's ability to represent brain networks.

(4) Achieving significant performance improvement: The model was validated through five-fold cross-validation on the MODMA dataset. Experimental results demonstrate that the model achieves high accuracy, precision, recall, and F1 score, showing a significant improvement in depression detection performance. This confirms the model's effectiveness and potential for application in other multimodal detection tasks related to psychological and neurological disorders.

2. Related work

Conventional approaches to diagnosing depression primarily rely on subjective evaluations. Clinicians engage in observation, active listening, and inquiry with patients, integrating these insights with standardized assessment scales to formulate comprehensive diagnosis. With the advancement of technology, researchers can now diagnose depression using biological information, magnetic resonance imaging (MRI), and physiological signals ^[11]. Current research on depression detection mainly focuses on using single modalities such as an EEG, speech, and text, as well as multimodal approaches that combine social media text, interview speech, and video. This paper will focus on multimodal depression detection by combining EEG and speech signals, summarizing the related research work.

2.1. Depression detection based on EEG signals

Research has shown that there are distinct differences in electroencephalogram (EEG) signals between individuals with depression and healthy controls. For example, patients with depression exhibit different EEG signal characteristics within specific frequency bands ^[12]; there are also differences in the connectivity patterns of EEG signals between depressed individuals and healthy controls ^[13]. Additionally, the EEG responses of individuals with depression to stimuli or tasks differ, often showing weaker or slower responses compared to healthy individuals ^[14]. These responses are related to emotion regulation, cognitive control, and attention. However, EEG signals present certain challenges, such as temporal asymmetry, instability, low signal-to-noise ratio, and uncertainty regarding the specific brain regions involved in particular responses. In the review by Khare et al. ^[15], the methods for detecting mental disorders such as depression, autism, and obsessive-compulsive disorder using physiological signals are systematically discussed. A framework for the automatic detection of mental and developmental disorders using physiological signals is proposed. The review also explores the advantages of signal analysis, feature engineering, and decision-making, along with future development directions and challenges in this field. Therefore, depression detection based on EEG signals remains a challenging task.

The brain can be regarded as a complex network, where different brain regions are connected by neural fibers, forming an extensive interactive system. This network structure can be modeled as a graph, in which nodes represent EEG channels, and edges represent the connections between these channels. The graph structure is capable of capturing the intricate connectivity patterns between brain regions, which is beneficial for extracting the spatial features of EEG signals. Yang et al. ^[16] extracted nonlinear features, such as Lempel-Ziv complexity (LZC) and frequency domain power spectral density (PSD) features from EEG signals, analyzing the EEG during resting states with eyes closed and eyes open. They validated the effectiveness of multiple brain regions in detecting depression, identifying the temporal region as the most effective for depression detection with an accuracy rate of 87.4%. Considering the organizational structure of brain functional networks, Yao et al. ^[17] proposed the use of sparse group Lasso (sgLasso) to improve the construction of brain functional hyper-networks. They performed feature fusion and classification using multi-kernel learning on two sets of features with significant differences, selected through feature selection, achieving an accuracy of 87.88% after multi-feature fusion. Yang et al. ^[18] introduced a graph neural network-based method for depression recognition that utilizes data augmentation and model ensemble strategies. The method leverages graph neural networks to learn the features of brain networks and employs a model ensemble strategy to obtain predictions through majority voting on deep features. Experimental results demonstrated that graph neural networks possess strong learning capabilities for brain networks. Chen et al. ^[19] proposed a GNN-based multimodal fusion strategy for depression detection, exploring the heterogeneity and homogeneity among various physiological and psychological modalities and investigating potential relationships between subjects. Zhang et al. ^[20] developed a model based on graph convolutional networks (GCNs) with sub-attentional segmentation and an attention mechanism (SSPA-GCN). The model incorporates domain generalization through adversarial training, and experimental results showed that GCNs effectively capture the spatial features of EEG signals. Wu et al. ^[21] introduced a spatial-temporal graph convolutional network (ST-GCN) model for depression detection, creating an adjacency matrix for EEG signals using the phase-locking value (PLV). The ST-GCN network, constructed with spatial convolution blocks and standard temporal convolution blocks, improved the learning capacity for spatial-temporal features. Experimental results indicated that the ST-GCN combined with depression-related brain functional connectivity maps holds potential for clinical diagnosis. The attention mechanism provides an effective means to dynamically focus on the critical parts of the input information, capture long-range dependencies, enhance model interpretability, and thereby improve performance. Qin Jing et al. ^[22] proposed a probabilistic sparse self-attention neural network (PSANet) framework for depression diagnosis based on the EEG, integrating the EEG with the physiological parameters of patients for multidimensional diagnosis. The experimental results demonstrated that the fusion of physiological signals with other dimensions of signals achieved high classification accuracy. Jiang et al. ^[23] proposed a novel multi-graph learning neural network (MGLNN), which learns the optimal graph structure most suitable for GNN learning from multiple graph structures. The MGLNN demonstrates strong classification performance in multi-graph semi-supervised tasks. While depression detection based on EEG signals remains challenging, the application of graph structures and multimodal fusion plays a significant role in enhancing detection performance. Furthermore, depression detection based on speech signals is also of great importance.

2.2. Depression detection based on speech signals

Individuals with depression exhibit abnormalities in behavioral signals, such as speech signals, compared to healthy individuals. Depression patients display certain vocal characteristics, including alterations in pitch, tone, speech rate, and volume, such as low, muffled, and weak voice quality ^[24]. The degree of speech clarity and fuzziness is also associated with depression, and analyzing and processing speech signals can help extract features relevant to depression. Kim et al. ^[25] employed a CNN model to analyze the mel-spectrograms of speech signals, learning the acoustic characteristics of individuals with depression. Their results indicated that deep learning methods outperformed traditional learning approaches, achieving a maximum accuracy of 78.14%. Yang et al. ^[26] proposed a joint learning framework based on speech signals, called the depression-aware learning framework (DALF), which includes the depression filter bank learning (DFBL) module and the multi-scale spectral attention (MSSA) module. On the DAIC-WOZ dataset, their approach achieved an F1 score of 78.4%, offering a promising new method for depression detection. Yin et al. ^[27] introduced the transformer-CNN-CNN (TCC) model for depression detection based on speech signals, utilizing parallel CNN modules to focus on local knowledge, while parallel transformer modules with linear attention mechanisms captured temporal sequence information. Experimental results on the DAIC-WOZ and MODMA datasets demonstrated that TCC performed well with relatively low computational complexity. However, depression detection based on a single modality still has certain limitations, such as insufficient robustness in specific contexts.

2.3. Multimodal depression detection

Several researchers have adopted multimodal approaches to enhance the performance of depression diagnosis. Generally, multimodal fusion methods are categorized into early fusion, intermediate fusion, and late fusion ^[28]. Early fusion, also known as feature-level fusion, typically involves concatenating features from multiple modalities and then feeding them into a predictive model. Late fusion, also referred to as decision-level fusion, merges information from different modalities at the decision stage, facilitating the integration of multimodal information. Decision-level fusion preserves the decision results of each modality, avoiding potential feature information loss or blurring that may occur in feature-level fusion. It also considers the weight and importance of each modality, thereby fully integrating information from different modalities, which helps improve the model's understanding of multimodal data. In the field of image modality fusion, Liu et al. ^[29] proposed a novel adversarial learning-based multimodal fusion method for MR images. This method utilizes a segmentation network as the discriminator, enhancing the correlation of tumor pathological information by fusing contrast-enhanced T1-weighted images and fluid attenuated inversion recovery (FLAIR) MRI modalities. Zhu et al. ^[30] introduced a brain tumor segmentation approach based on deep semantic and edge information fusion. They used the swin transformer to extract semantic features and designed an edge detection module based on convolutional neural networks (CNNs). The proposed MFIB (multi-feature information blending) fusion method combines semantic and edge features, providing potential applications for multimodal fusion in disease detection. To further improve segmentation accuracy, Zhu et al. ^[31] proposed an end-to-end three-dimensional brain tumor segmentation model, which includes a modality information extraction (MIE) module, a spatial information enhancement (SIE) module, and a boundary shape correction (BSC) module. The output is then input into a deep convolutional neural network (DCNN) for learning, significantly improving segmentation accuracy. In recent studies, Liu et al. ^[32] proposed a statistical method to validate the effectiveness of objective metrics in multi-focus image fusion and introduced a convolutional neural network-based fusion measure. This measure quantifies the similarity between source images and fused images based on semantic features across multiple layers, providing a new approach for image-based multi-feature fusion. Bucur et al. ^[33] proposed a time-based multimodal transformer architecture that utilizes pre-trained models to extract image and text embeddings for detecting depression from social media posts. M. Roy et al. ^[34] proposed an improved version of the YOLOv4 algorithm, integrating DenseNet to optimize feature propagation and reuse. The modified path aggregation network (PANet) further enhances the fusion of multi-scale local and global feature information, providing an effective method for multi-feature fusion. To further improve fusion performance, M. Roy et al. ^[35] introduced a DenseNet and swin-transformer-based YOLOv5 model (DenseSPH-YOLOv5). By combining DenseNet and a Swin transformer, this model enhances feature extraction and fusion capabilities, incorporating a convolutional block attention module (CBAM) and a Swin transformer prediction head (SPH). This significantly improves the model's detection accuracy and efficiency in complex environments. Jamil et al. ^[36] proposed an efficient and robust phonocardiogram (PCG) signal classification framework based on a vision transformer (ViT). The framework extracts MFCC and LPCC features from 1D PCG signals, as well as various deep convolutional neural network (D-CNN) features from 2D PCG signals. Feature selection is performed using natural/biologically inspired algorithms (NIA/BIA), while a ViT is employed to implement a self-attention mechanism on the time-frequency representation (TFR) of 2D PCG signals. Experimental results demonstrate the effectiveness of the ViT in PCG signal classification. Fan et al. ^[37] introduced a transformer-based multimodal depression detection framework (TMFE) that integrates video, audio, and rPPG signals. This framework employs CNNs to extract video and audio features, uses an end-to-end framework to extract rPPG signal values, and finally inputs them into MLP layers for depression detection. The results showed that multimodal depression detection outperformed unimodal approaches, and the combination of physiological signals with behavioral signals demonstrated significant advantages. Ning et al. ^[38] proposed a depression detection framework that integrates linear and nonlinear features of EEG and speech signals, achieving an accuracy of 86.11% for depression patient recognition and 87.44% for healthy controls on the MODMA dataset. Abdul et al. ^[39] presented an end-to-end multimodal depression detection model based on speech and EEG modalities. This model uses a 1DCNN-LSTM to capture the temporal information of the EEG, while combining 2D time-frequency features of EEG signals and 2D mel-spectrogram features of speech signals, inputting them into a vision transformer model for depression detection. The experimental results demonstrated that the vision transformer effectively learned the spectral features of both EEG and speech signals.

Inspired by the above studies, this paper proposes a multimodal depression diagnosis method that combines physiological signals and speech signals, utilizing a graph convolutional network to model the relationships between brain channels and capture deep spatial and spectral features of EEG signals. The method also introduces decision-level fusion of multiple EEG features and speech spectral features, which more comprehensively integrates deep depression-related information from EEG and speech signals, significantly improving depression detection performance.

3. Materials and methods

This section presents a multimodal model for the detection of depression, which integrates electroencephalogram (EEG) and speech signals. The model employs a multi-head attention graph convolutional network (MHA-GCN) and vision transformer (ViT) to extract deep spatiotemporal and spectral features from EEG signals, while utilizing the ViT model to extract deep spectral features from speech signals. The proposed model framework, illustrated in Figure 1, consists of three main components: the input layer, the feature extraction layer, and the decision fusion and classification layer.

Figure 1. Model framework diagram.

DownLoad: Full-Size Img PowerPoint

3.1. Discrete wavelet transform

Due to the complex time-frequency characteristics of EEG signals, which include varying frequency components and temporal waveform changes, traditional frequency-domain or time-domain analysis methods are insufficient for capturing comprehensive signal features. The discrete wavelet transform (DWT) allows for multi-scale decomposition of the signal, effectively capturing the feature information of EEG signals across different time scales and frequencies. By analyzing and processing the coefficients obtained from DWT decomposition, a more thorough understanding and description of the time-frequency characteristics of EEG signals can be achieved.

In this paper, after preprocessing the EEG signals (with details provided in Section 3.2.1), the DWT is used to extract wavelet features from each EEG channel as node features, which are then input into the GCN module. The formula for the wavelet transform is:

$\begin{equation} WT(\alpha ,\tau) = \frac{1}{\sqrt{\alpha}}\int_{-\infty}^{\infty} f(t)^*\psi (\frac{t-\tau}{\alpha})\,dt, \end{equation}$

(3.1)

In Eq (3.1), $alpha$ represents the scale and $\tau$ represents the translation. The scale is inversely proportional to the frequency, while the translation corresponds to time. The scale controls the stretching or compression of the wavelet function, and the translation controls the shifting of the wavelet function. A window function is introduced as follows:

$\begin{equation} \psi_{a,b}(t) = \frac{1}{\sqrt{a}}\psi(\frac{t-b}{a}), \end{equation}$

(3.2)

Based on this, the formula for the continuous wavelet transform (CWT) is given by:

$\begin{equation} W_{\psi}f(a,b) = \frac{1}{\sqrt{a}}\int_{-\infty}^{\infty} f(t)\psi^*(\frac{t-b}{a})\,dt, \end{equation}$

(3.3)

Eq (3.3), $a$ represents the scale shift and $b$ represents the time shift. By restricting the two variables in the wavelet basis function to discrete points, the formula for the discrete wavelet transform (DWT) is obtained:

$\begin{equation} W_{\psi}f(j,k) = \int_{-\infty}^{\infty}f(t)\psi^{*}_{j,k}(t)\,dt, \end{equation}$

(3.4)

The discrete wavelet features of each channel are then stored as the node features of the MHA-GCN. The DWT process is illustrated in Figure 2.

Figure 2. DWT extraction of discrete wavelet features from EEG channels.

DownLoad: Full-Size Img PowerPoint

3.2. GCN module

Given that the spatial relationships between EEG channels are crucial for understanding brain function and pattern recognition, graph convolutional networks (GCNs) are capable of propagating information among neighboring nodes, capturing local spatial characteristics while considering global contextual relationships. Based on this, this paper constructs a graph convolutional network where each EEG channel is treated as a node. The Pearson correlation coefficient (PCC) between the features of each channel is calculated, and the adjacency matrix is obtained from the PCC matrix of the channels. The feature matrix, constructed with the DWT features extracted from each channel as node features, is then fed into the constructed graph convolutional network. The network, combined with the multi-head attention mechanism, extracts the deep correlation features between EEG channels.

In a graph convolutional network (GCN), a graph is defined as $G = (V, E, A)$ , where $V$ represents the set of nodes, $E$ is the set of edges, and $A$ is the adjacency matrix of the graph $G$ . $D$ is a diagonal matrix, $D_ii = \sum_j A_ij$ , representing the degree of the node $v_i$ . If there is an edge between node $i$ and node $j$ , $A(i, j)$ denotes the weight of the edge; otherwise, $A(i, j) = 0$ . For an unweighted graph, $A(i, j)$ is typically set to 1 if an edge exists and is 0 otherwise.

For all $V$ , $H^{(l)}$ represents the feature matrix of all nodes at layer $l$ , and $H^{(l+1)}$ represents the feature matrix after one graph convolution operation. The formula for a single graph convolution operation is given by:

$\begin{equation} H^{(l+1)} = \sigma \left( \widetilde{D}^{-\frac{1}{2}} \widetilde{A} \widetilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)} \right), \quad \widetilde{A} = A + I, \end{equation}$

(3.5)

In Eq (3.5), $I$ denotes the identity matrix. $\tilde{D}$ represents the degree matrix of $\tilde{A}$ , which is computed as $\tilde{D} = \sum \tilde{A}_{ij}$ . $\sigma$ denotes a nonlinear activation function, such as the ReLU function. $W^{(l)}$ represents the trainable parameter matrix for the graph convolution transformation at the current layer.

The GCN constructed in this paper comprises two layers of GCNLayer, with each layer's forward propagation including a linear layer and a convolution layer. The input data consists of the adjacency matrix and node features. Figure 3 illustrates the MHA-GCN model with four attention heads.

Figure 3. MHA-GCN model.

DownLoad: Full-Size Img PowerPoint

3.3. Multi-head attention module

To enhance the GCN's representation learning capability for the brain channel network graph and to improve its ability to understand and express relationships between nodes, a multi-head attention mechanism module is introduced. Given the query $q \in \mathbb{R}^{d_q}$ , key $k \in \mathbb{R}^{d_k}$ , and value $v \in \mathbb{R}^{d_v}$ , the calculation formula for each attention head $h_i(i = 1, ..., n)$ is as follows:

$\begin{equation} h_i = f\left(W_i^{(q)}q, W_i^{(k)}k, W_i^{(v)}v \right) \in \mathbb{R}^{p_v}, \end{equation}$

(3.6)

In Eq (3.6), the learnable parameters are $W_i^{(q)} \in \mathbb{R}^{p_q\times{d_q}}$ , $W_i^{(k)} \in \mathbb{R}^{p_k\times{d_k}}$ , and $W_i^{(v)} \in \mathbb{R}^{p_v\times{d_v}}$ . The function $f$ representing attention pooling can be either additive attention or scaled dot-product attention.

The output of the multi-head attention mechanism undergoes another linear transformation, corresponding to the concatenated results of the $h$ heads. Therefore, the learnable parameters are $W_o \in \mathbb{R}^{p_o\times{hp_v}}$ , specifically defined as $W_o \begin{bmatrix} h_1 \\ \vdots \\ h_h \end{bmatrix} \in \mathbb{R}^{p_o}$ . This allows each head to focus on different parts of the input. The multi-head attention mechanism is illustrated in . In the proposed model, the multi-head attention mechanism is combined with the graph convolutional network (GCN). The node feature matrix output $X$ from the second GCN layer is used as input and is multiplied by the weight matrices $W^q$ , $W^k$ , and $W^v$ obtained during training, respectively, to derive the query $q$ , key $k$ , and value $v$ for the multi-head attention mechanism. The output represents the deep features for each node.

Figure 4. Multi-head attention mechanism.

DownLoad: Full-Size Img PowerPoint

First, the node $i$ is linearly transformed with the nodes $j$ in its first-order neighborhood.

$\begin{equation} q_{c,i} = W_{c,q}h_i+b_{c,q}, \\k_{c,j} = W_{c,k}h_j+b_{c,k}, \end{equation}$

(3.7)

In Eq (3.7), $q_{c, i}$ represents the transformed feature vector of the central node $i$ , $k_{c, j}$ represents the feature vector of a neighboring node $j$ , $W_{c, q}$ , $W_{c, k}$ , $b_{c, q}$ , and $b_{c, k}$ represent the learnable weights and biases, and $c$ represents the number of attention heads, which is set to 4 in this study.

Next, the multi-head attention coefficients between the central node and its neighboring node $j$ are calculated using scaled dot-product attention, as described in the following equation:

$\begin{equation} \alpha_{c,ij} = \frac{q_{c,i},k_{c,j}}{\sum _{u\in(i)}[q_{c,i},k_{c,u}]}, \end{equation}$

(3.8)

In Eq (3.8), $\alpha_{c, ij}$ represents the multi-head attention coefficient for each node, $[q, k] = exp(\frac{q^{T}k}{\sqrt{d}})$ , respectively, and $d$ denotes the dimensionality of the node's hidden layer.

After obtaining the multi-head attention coefficients between each node and its neighboring nodes, we apply a linear transformation to the feature vectors of the neighboring nodes, $v_{c, j} = W_{c, v}h_j+b_{c, v}$ . The feature vector of node $j$ after the linear transformation is denoted as $v_{c, j}$ , where $W_{c, v}$ and $b_{c, v}$ represent the learnable weights and biases.

We then multiply the transformed feature vectors of the neighboring nodes by the corresponding multi-head attention coefficients and take the average to obtain the importance score of each node, as shown in the following equation:

$\begin{equation} Z = \frac{1}{C}\sum\limits_{c = 1}^{C}(\sum\limits_{j\in N(i)}\alpha_{c,ij}v_{c,j}), \end{equation}$

(3.9)

In Eq (3.9), $Z = {z_1, z_2, ..., z_n}\in R^{n\times 1}$ , $z_i$ represents the importance score of node $i$ , and $C$ denotes the number of attention heads used in the attention mechanism.

By applying attention weighting to the feature matrix output by the GCN, the expressive power of the features can be enhanced, enabling the model to better learn the complex patterns and structures inherent in the graph data.

3.4. Vision transformer module

In order to thoroughly analyze the time-frequency domain characteristics of electroencephalogram (EEG) and speech signals, these signals are transformed into two-dimensional (2D) EEG time-frequency spectrograms and 2D mel-spectrograms of speech using a short-time Fourier transform (STFT). The 2D EEG time-frequency spectrogram combines the time and frequency dimensions to display the frequency domain characteristics and temporal dynamics of EEG signals. This approach facilitates the investigation of brain frequency activity patterns, inter-regional brain coordination, and the dynamic changes in frequency components, which are crucial for understanding the spectral properties of EEG signals and conducting EEG signal analysis. The mel-spectrogram reflects the frequency distribution of speech signals, encompassing frequency components, pitch characteristics, resonance features, and acoustic traits, which highlight the acoustic differences between speech modalities in depressed patients and healthy subjects. The vision transformer effectively integrates global dependencies within the signals. Moreover, by applying the self-attention mechanism, the vision transformer assigns varying weights to signals at different time points, thereby addressing the non-stationary nature inherent in EEG and speech signals. Inspired by the work of Abdul et al. ^[39], this paper employs the vision transformer's positional encoding module and multi-head self-attention module to extract deep frequency domain features from EEG and speech signals. These features are then fused to classify depression by combining the multi-features of EEG signals with the frequency domain features of speech signals.

The structure of the vision transformer (ViT) model is illustrated in . It includes a linear layer, a positional encoding module, and a multi-head self-attention module. The model parameters are configured as vit_base_patch16_224 ^[40], which defines the basic input size of the model. The 2D spectrograms are first divided into $16\times16$ patches, which then serve as the elements of the model's input sequence, with the resolution set to $224\times224$ pixels. In this study, the inputs to the ViT model are the 2D EEG time-frequency spectrograms and speech spectrograms, both in PNG format. The queries $q$ , keys $k$ , and values $v$ of the ViT are obtained by applying linear transformations to the input sequences, followed by multiplication with the weight matrices, $W^q$ , $W^k$ , and $W^v$ , learned during the training process.

Figure 5. Vision transformer model structure diagram.

DownLoad: Full-Size Img PowerPoint

3.5. Decision-level weighted fusion

Traditional single-source or simple fusion methods may not meet the demands of complex tasks. To improve the accuracy and robustness of decision-making, decision-level weighted fusion assigns different weights to each information source, prioritizing decision cues with higher reliability or stronger relevance. This approach is better able to handle uncertainty and bias in the information. In this study, it is assumed that the outputs of classifiers based on EEG and speech signals are normalized and denoted as $[0, 1]$ , where $m = [m_1, m_2]$ , $n = [n_1, n_2]$ , $v = [v_1, v_2]$ , and $m_{i}$ , $n_{i}$ , and $v_{i}$ represent the probabilities predicted for normal and depressed states, respectively.

The validation accuracies of the EEG and speech models on the validation set are denoted as $\varepsilon = [\varepsilon_1, \varepsilon_2, \varepsilon_3]$ , respectively. Then, the weighted sum of the probabilities is computed, denoted as $p_{i} = \varepsilon_1*m_{i}+\varepsilon_2*n_{i}+\varepsilon_3*v_{i}, i = 1, 2$ , and the final predicted label $c$ is determined as follows:

$\begin{equation} c = \arg\max(p_i), \end{equation}$

(3.10)

4. Dataset and data preprocessing

4.1. Dataset

This study utilized the MODMA dataset ^[41], which is a multimodal depression dataset collected by the Key Laboratory of Wearable Computing, Gansu Province, Lanzhou University. The dataset primarily includes electroencephalogram (EEG) and speech data from clinical depression patients and matched healthy controls. The EEG data was collected using a traditional 128-channel electrode cap, involving 53 participants, including 24 outpatients with depression (13 males and 11 females; ages 16–56) and 29 healthy controls (20 males and 9 females; ages 18–55). The recordings include EEG signals in resting states and under stimulation. For the audio data collection, the dataset comprises 52 participants, including 23 outpatients with depression (16 males and 7 females; ages 16–56) and 29 healthy controls (20 males and 9 females; ages 18–55), with audio data recorded during interviews, reading tasks, and picture description tasks.

4.2. Data preprocessing

4.2.1. EEG data preprocessing

In this study, the preprocessing of EEG data was conducted utilizing the EEGLAB software. Initially, 29 channels were selected for each participant, namely [F7, F3, Fz, F4, F8, T7, C3, Cz, C4, T8, P7, P3, Pz, P4, P8, O1, O2, Fpz, F9, FT9, FT10, TP9, TP10, PO9, PO10, Iz, A1, A2, POz], with the EEG signal sampling frequency set to 250 Hz. All EEG signals underwent high-pass filtering with a cutoff frequency of 1 Hz and low-pass filtering with a cutoff frequency of 40 Hz. This filtering process preserved the EEG signal information while minimizing baseline drift and high-frequency electromyographic interference. Independent component analysis (ICA) was employed to eliminate artifacts associated with eye movements. Subsequently, discrete wavelet transform (DWT) was used to extract features from each channel, which were used as node features for constructing the brain channel network. Furthermore, a short-time Fourier transform (STFT) was applied to the artifact-free EEG data to extract two-dimensional time-frequency maps for the 29 channels, as illustrated in Figure 6.

Figure 6. Time frequency maps of EEG signals in patients with depression (MDD) and healthy subjects (HC), with MDD on the left and HC on the right.

DownLoad: Full-Size Img PowerPoint

4.2.2. Speech data preprocessing

Speech signals are temporal signals, and the preprocessing steps for speech signals include sampling and quantization, framing, windowing, and the short-time Fourier transform (STFT). The raw speech signal undergoes pre-emphasis, windowing, and framing. Pre-emphasis is a process that enhances the high-frequency components of the signal at the beginning of the transmission line to compensate for the excessive attenuation of high-frequency components during transmission. The mathematical representation of the pre-emphasis process is as follows:

$\begin{equation} H(z) = 1 - \mu z^{-1}, \end{equation}$

(4.1)

where $\mu$ represents the pre-emphasis coefficient, typically set to 0.97, consistent with the parameter settings in reference ^[42].

Speech signals are also time-varying signals. It is generally assumed that speech signals are stable and time-invariant over short periods, referred to as frames, typically ranging from 10 to 30 milliseconds. In this study, frames are defined as 25 milliseconds, and framing is achieved using a weighted method with a finite-length movable window. A Hamming window is used as the window function, and the windowed speech signal is obtained by multiplying the window function with the signal:

$\begin{equation} S_w(n) = s(n) * w(n), \end{equation}$

(4.2)

In Eq (4.2), where $w(n)$ represents the Hamming window of length $N$ :

$\begin{equation} w(n) = \begin{cases} 0.54 - 0.46 \cos \left( \frac{2\pi n}{N-1} \right), & 0 \leq n \leq (N-1) \\ 0, & n = \text{other} \end{cases}, \end{equation}$

(4.3)

Subsequently, each frame is subjected to the short-time Fourier transform (STFT), which facilitates the transformation of the time-domain signal into a frequency-domain representation. The energy spectrum of each frame is then computed by squaring the magnitude of each frequency spectrum point to obtain the power spectrum of the speech signal. A mel filter bank is designed to filter the power spectrum, with the frequency response of the filter defined as:

$\begin{equation} H_m(k) = \begin{cases} 0, & k < f(m-1) \\ \frac{2(k - f(m-1))}{(f(m+1) - f(m-1))(f(m) - f(m-1))}, & f(m-1) \leq k \leq f(m) \\ 0, & k \geq f(m+1) \end{cases}, \end{equation}$

(4.4)

where $f(m)$ represents the center frequency. Finally, the mel spectrogram of the speech signal is obtained, as shown in Figure 7.

Figure 7. Mel spectrogram of speech signals.

DownLoad: Full-Size Img PowerPoint

5. Experimental results and analysis

5.1. Experimental environment configuration

The experiments in this study were conducted using an NVIDIA GeForce GTX 3060 GPU, 16GB of RAM, with Python 3.9.0 and PyTorch 1.11.0 operating systems. To validate the effectiveness and generalization capability of the model, five-fold cross-validation was employed. The dataset was randomly divided into five subsets, with one subset used as the test set and the remaining four subsets used as the training set for each iteration of the cross-validation. The optimal parameters obtained from the training process are summarized in Table 1.

Table 1. Model training parameters.

Parameters	Value
Learning rate	0.001
Batchsize	4
Epoch	400
Loss function	CrossEntropyLoss
Optimizer	Adam

| Show Table

DownLoad: CSV

The original EEG signal has a dimension of $128 \times T$ where 128 represents the number of channels and T is the length of the signal. After applying a discrete wavelet transform (DWT), the output has a dimension of $29 \times 15 \times 180$ , indicating 29 channels, 15 DWT features, and a time step of 180. Meanwhile, the EEG signal undergoes a short-time Fourier transform (STFT) to extract spectral features, resulting in a time-frequency representation with a dimension of $1 \times 512$ . The speech signal, after being processed with a STFT to extract mel-frequency cepstral coefficients (MFCCs), is transformed into a mel spectrogram with a dimension of $3 \times 224 \times 224$ . The model structure parameters are shown in Table 2.

Table 2. The structure configuration of the MHA-GCN_ViT model.

Module	Network layer	Inputs	Outputs	Activation function	Parameters memory (MB)
MHA-GCN	GCNLayer1	(29, 15*180,180)	(29, 15*180,128)	RELU	5.578
	GCNLayer2	(29, 15*180,128)	(29,128,512)	-
	multihead_attn	(29,128,512)	(1,512)	Sigmoid
EEG_ViT	vit_base_patch16_224	(3,224,224)	(1,512)	Sigmoid	226.773
Speech_ViT	vit_base_patch16_224	(3,224,224)	(1,512)	Sigmoid	328.682
Decision-level weighted fusion	Weighted Sum and FC	(1,512)	(1, 2)	Sigmoid	2.64
		(1,512)
		(1,512)
Total					563.673

| Show Table

DownLoad: CSV

5.2. Evaluation metrics

To assess the performance and effectiveness of the proposed model, the evaluation metrics used in this study include accuracy, precision, recall, and F1 score. The calculation formulas are as follows:

$\begin{equation} Accuracy = \frac{TP+TN}{TP+FN+FP+TN}, \end{equation}$

(5.1)

$\begin{equation} Precision = \frac{TP}{TP+FP}, \end{equation}$

(5.2)

$\begin{equation} Recall = \frac{TP}{TP+FN}, \end{equation}$

(5.3)

$\begin{equation} F1 = \frac{2 \times Recall \times Recall}{Recall + Precision}, \end{equation}$

(5.4)

In the formulas, TP (True Positive) denotes the number of true positive instances, TN (True Negative) denotes the number of true negative instances, FP (False Positive) denotes the number of false positive instances, and FN (False Negative) denotes the number of false negative instances.

5.3. Experimental results

5.3.1. Validation of the effectiveness of the MHA-GCN and decision-level fusion

To validate the effectiveness of the MHA-GCN and decision-level fusion, three models were designed for comparison:

(1) MHA-GCN + Feature Fusion: This model employs feature-level fusion in the fusion stage while keeping the other components unchanged.

(2) 1DCNN-LSTM + Decision Fusion: This model uses a 1DCNN-LSTM to extract deep features from the EEG discrete wavelet features, with the remaining components unchanged.

(3) MHA-GCN + Decision Fusion: This is the proposed model, denoted as MHA-GCN_ViT.

The performance of depression detection was compared across these models using the MODMA dataset, and the experimental results are presented in Table 3.

Table 3. Results of comparative experiments for the MHA-GCN and decision-level fusion.

Model	MHA-GCN	Feature fusion	Decision fusion	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
MHA-GCN+Feature Fusion	$\checkmark$	$\checkmark$	-	75.42	75.33	75.41	75.32
1DCNN-LSTM+Desion Fusion	-	-	$\checkmark$	78.07	78.08	78.07	77.90
MHA-GCN+Decision Fusion	$\checkmark$	-	$\checkmark$	89.03	90.16	89.04	88.83

| Show Table

DownLoad: CSV

A comparative analysis of models 1 and 3 reveals that the decision-level fusion achieved an accuracy of 89.03%, precision of 90.16%, recall of 89.04%, and an F1 score of 88.83%. These metrics represent improvements of 13.61%, 14.83%, 13.63%, and 13.51%, respectively, over feature-level fusion. This enhancement is attributed to the decision-level fusion's integration of different modalities at the decision stage, leveraging the complementarity and richness of multimodal information. It enhances the model's understanding and comprehensive judgment capabilities by reducing feature redundancy and avoiding repetition or conflicts between different modalities. This refinement leads to more precise and effective feature learning, improving the model's generalization ability, robustness, and reliability, thereby demonstrating the effectiveness and superiority of multimodal decision-level fusion.

In the comparison between models 2 and 3, the use of the MHA-GCN resulted in improvements of 10.96%, 12.08%, 10.97%, and 10.93% in accuracy, precision, recall, and F1 score, respectively, relative to the 1DCNN-LSTM model. This improvement is due to MHA-GCN's capability to effectively capture the global spatial features of EEG signals. The MHA-GCN offers greater flexibility and efficiency in modeling and feature learning for graph data, learning more representative feature representations, effectively reducing feature space dimensions, and enhancing the model's abstraction and expression of EEG signals. This, in turn, improves performance in classification tasks. The training and accuracy curves of the models are shown in Figure 8.

Figure 8. Model training curve and accuracy curve.

DownLoad: Full-Size Img PowerPoint

5.3.2. Ablation study of multimodal fusion

To validate the effectiveness of multimodal fusion, we designed ablation experiments utilizing four distinct models: the single-modal EEG_MHA-GCN, the single-modal EEG_ViT, the single-modal Audio_ViT, and the multimodal MHA-GCN_ViT.

(1) EEG_MHA-GCN: This model performs depression classification using single-modal EEG signals. After preprocessing, discrete wavelet features are extracted and combined with the MHA-GCN model.

(2) EEG_ViT: This model uses single-modal EEG signals. Time-frequency features are extracted from EEG signals and processed by the vision transformer for depression classification.

(3) Audio_ViT: This model utilizes single-modal audio signals. Mel-spectrogram features from the audio signals are processed by the vision transformer for depression classification.

(4) MHA-GCN_ViT: This multimodal model integrates both EEG and audio signals through decision-level fusion for depression classification. The experimental results are presented in Table 4.

Table 4. Results of ablation experiments for single and multimodal modalities.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
EEG_MHA-GCN	66.45	78.56	66.46	68.55
EEG_ViT	74.05	73.84	75.28	67.38
Audio_ViT	62.30	60.83	62.57	63.30
MHA-GCN_ViT	89.03	90.16	89.04	88.83

| Show Table

DownLoad: CSV

The results of the multimodal ablation experiments demonstrate that the MHA-GCN_ViT model achieved an accuracy of 89.03%, precision of 90.16%, recall of 89.04%, and F1 score of 88.83%. This performance can be attributed to the comprehensive integration of features from multiple modalities, which provides richer and more comprehensive information, enhancing both the data representation and the model's understanding capabilities. The effectiveness of multimodal fusion is thus confirmed. Furthermore, the MHA-GCN model, which combines multi-head attention and graph convolutional networks, effectively learns and represents the spatial features of EEG channels, capturing the inter-channel relationships and significance of EEG signals. This approach allows the model to better comprehend the structural characteristics of EEG signals, improving feature representation and the model's understanding capabilities. Decision-level fusion of EEG and audio signals at the decision layer leverages the complementarity and richness of multimodal information, enhancing the model's overall understanding and judgment capabilities. By integrating cross-modal information, the model can more comprehensively consider relationships between different modalities, thus improving the accuracy and reliability of depression detection.

5.3.3. Comparison with other models

To evaluate the advanced nature of the proposed model, comparative experiments were conducted on the MODMA dataset with other models, and all results are sourced from the original papers.

MS2-GNN Model ^[19]: This model constructs intra-modal and inter-modal graph neural networks for EEG and speech signals after passing them through LSTM networks. The features are then fused and classified using attention mechanisms.

HD_ES Model ^[39]: This model extracts features from EEG signals using 1DCNN-LSTM and a vision transformer, and features from speech signals using a vision transformer. These features are then fused and classified.

Fully Connected Model ^[43]: Features from EEG and speech signals are extracted separately and then fused. Classification is performed using a deep neural network (DNN).

HD_ES ^[39]* is the experimental environment used in this study. The results were obtained under the same training and test datasets.

MultiEEG-GPT Model ^[44]: The MultiEEG-GPT model involves preprocessing EEG signals through filtering and constructing a topology graph. These EEG features, along with features extracted from speech signals (e.g., MFCCs, mel-spectrogram, chroma STFT), are fed into the GPT-4 API for feature learning and classification. MultiEEG-GPT-1 refers to the classification results under zero-shot prompting, while MultiEEG-GPT-2 refers to the classification results under few-shot prompting, as shown in . The results in the table indicate that the proposed model achieves improvements over the $MS^2-GNN$ model on the MODMA dataset, with increases of 2.54% in accuracy, 7.81% in precision, 1.54% in recall, and 3.98% in F1 score. Compared to the HD_ES* model, the proposed model shows improvements of 2.36% in accuracy, 2.95% in precision, 2.36% in recall, and 0.94% in F1 score. Compared to the Fully_Connected model, the proposed model achieves a notable enhancement of 12.72% in accuracy. These improvements are attributed to the proposed model's use of more comprehensive and effective feature representation methods, which better capture the inherent information and features of the data. This results in superior expressiveness and adaptability, thereby enhancing the model's performance and generalization capability. Consequently, the proposed model demonstrates better results in the experiments and outperforms the baseline models, confirming its effectiveness.

Table 5. Comparative experimental results of different models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
MS2-GNN ^[19]	86.49	82.35	87.50	84.85
HD_ES ^[39]	97.31	97.71	97.34	97.30
HD_ES ^[39]*	86.67	87.21	86.68	87.89
Fully_connected Model ^[43]	76.31	-	-	-
MultiEEG-GPT-1 ^[44]	73.54±2.03	-	-	-
MultiEEG-GPT-2 ^[44]	79.00±1.59	-	-	-
Ours	89.03	90.16	89.04	88.83

| Show Table

DownLoad: CSV

6. Conclusions

This paper proposes a multimodal depression detection model based on EEG and speech signals, utilizing a graph-based approach to model EEG channels and construct a brain channel network structure to capture the spatial features of EEG signals. To enhance depression detection performance, the model integrates speech signals for comprehensive assessment. Through comparative experiments and multimodal ablation studies, the effectiveness of the MHA-GCN and decision-level fusion have been demonstrated. The model's performance metrics have been shown to outperform those of baseline models in comparison with other advanced models.

Model advantages: The strength of our model lies in the MHA-GCN_ViT, which effectively extracts and integrates the time-frequency and spatiotemporal features of EEG signals, as well as the time-frequency features of speech signals. This enhances the model's ability to leverage information from different data sources. Additionally, the use of the GCN and vision transformer enables the model to capture the complex structures and dependencies within EEG and speech signals. Furthermore, MHA-GCN_ViT demonstrates strong performance and robustness in the task of depression detection.

Model limitations: Current issues include assessing whether the model is suitable for different degrees and types of depression, as well as understanding how individual differences such as age, gender, and cultural background might impact model performance. The MHA-GCN_ViT combines complex deep learning architectures such as the GCN and ViT, resulting in high computational complexity. In resource-constrained scenarios, such as on devices with limited computing power, the model may face significant computational burdens, potentially affecting real-time performance. Additionally, the model's generalization performance in other datasets or real clinical environments needs further validation to ensure its stability and reliability.

Future research directions: Based on the research by Khare et al. ^[45], we have identified that model interpretability and uncertainty quantification are of significant importance for emotion recognition, as well as for its application areas such as depression detection. Therefore, our future research direction will focus on investigating the individual differences that contribute to model variability, enhancing the model's generalization and robustness across different domains, such as anxiety, schizophrenia, and other diseases, as well as across various datasets. We also aim to improve the model's interpretability and uncertainty quantification. Furthermore, Singh et al. ^[46] developed a "Tinku" robot, which employs advanced deep learning models to assist in training children with autism, demonstrating the promising potential of these models in the field of disease treatment. Additionally, we will explore the application of our model in real-world clinical settings, developing corresponding tools or platforms to promote the practical application and dissemination of the model in depression detection and treatment ^[47].

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work is supported by the Doctoral Research Start-up Fund of Shaanxi University of Science and Technology (2020BJ-30) and the National Natural Science Foundation of China (61806118).

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	Abbas F, Ali S, Ahmad M (2023) Does economic growth affect the relationship between banks' capital, liquidity and profitability: empirical evidence from emerging economies. J Econ Admin Sci 39: 366–381. https://doi.org/10.1108/JEAS-03-2021-0056 doi: 10.1108/JEAS-03-2021-0056
[2]	Agoraki MEK, Kouretas GP (2019) The determinants of net interest margin during transition. Rev Quant Financ Account 53: 1005–1029. https://doi.org/10.1007/s11156-018-0773-y doi: 10.1007/s11156-018-0773-y
[3]	Ahmad R, Albaity M (2019) The determinants of bank capital for East Asian countries. Glob Bus Rev 20: 1311–1323. https://doi.org/10.1177/0972150919848915 doi: 10.1177/0972150919848915
[4]	Ahmad R, Ariff M, Skully MJ (2009) The Determinants of Bank Capital Ratios in a Developing Economy. Asia Pac Financ Mark 15: 255–272. https://doi.org/10.1007/s10690-009-9081-9 doi: 10.1007/s10690-009-9081-9
[5]	Aktas R, Acikalin S, Bakin B, et al. (2015) The determinants of banks' capital adequacy ratio: Some evidence from South Eastern European countries. J Econ Behav Stud 7: 79–88. https://doi.org/10.22610/jebs.v7i1(J).565 doi: 10.22610/jebs.v7i1(J).565
[6]	Al-Harbi A (2019) The determinants of conventional banks profitability in developing and underdeveloped OIC countries. J Econ Financ Adm Sci 24: 4–28. https://doi.org/10.1108/JEFAS-05-2018-0043 doi: 10.1108/JEFAS-05-2018-0043
[7]	Al-Hares OM, AbuGhazaleh NM, El-Galfy AM (2013) Financial performance and compliance with Basel Ⅲ capital standards: Conventional vs. Islamic banks. J Appl Bus Res 29: 1031–1048.
[8]	Al-Nasser Mohammed SAS, Muhammed J (2017) The relationship between agency theory, stakeholder theory and Shariah supervisory board in Islamic banking: An attempt towards discussion. Humanomics 33: 75–83. https://doi.org/10.1108/H-08-2016-0062 doi: 10.1108/H-08-2016-0062
[9]	Allen L (1988) The Determinants of Bank Interest Margins: A Note. J Financ Quant Anal 23: 231–235. https://doi.org/10.2307/2330883 doi: 10.2307/2330883
[10]	Alqahtani F, Mayes DG, Brown K (2016) Economic turmoil and Islamic banking: Evidence from the Gulf Cooperation Council. Pac-Basin Financ J 39: 44–56. https://doi.org/10.1016/j.pacfin.2016.05.017 doi: 10.1016/j.pacfin.2016.05.017
[11]	Alshammari T (2022) State ownership and bank performance: conventional vs Islamic banks. J Islamic Account Bus Res 13: 141–156. https://doi.org/10.1108/JIABR-06-2021-0161 doi: 10.1108/JIABR-06-2021-0161
[12]	Alves CF, Citterio A, Marques BP (2023) Bank-specific capital requirements: Short and long-run determinants. Financ Res Lette 52: 103558. https://doi.org/10.1016/j.frl.2022.103558 doi: 10.1016/j.frl.2022.103558
[13]	Alvi MA, Akhtar K, Rafique A (2021) Does efficiency play a transmission role in the relationship between competition and stability in the banking industry? New evidence from South Asian economies. J Public Aff 21. https://doi.org/10.1002/pa.2678 doi: 10.1002/pa.2678
[14]	Angbazo L (1997) Commercial bank net interest margins, default risk, interest-rate risk, and off-balance sheet banking. J Bank Financ 21: 55–87. https://doi.org/10.1016/S0378-4266(96)00025-8 doi: 10.1016/S0378-4266(96)00025-8
[15]	Archarya VV, Mehran H, Schuermann T, et al. (2012) Robust capital regulation. Curr Issues Econ Financ 18: 1–11. https://doi.org/10.2139/ssrn.2070102 doi: 10.2139/ssrn.2070102
[16]	Arellano M, Bover O (1995) Another look at the instrumental variable estimation of error-components models. J Econometrics 68: 29–51. https://doi.org/10.1016/0304-4076(94)01642-D doi: 10.1016/0304-4076(94)01642-D
[17]	Balla E, Rose MJ (2019) Earnings, risk-taking, and capital accumulation in small and large community banks. J Bank Financ 103: 36–50. https://doi.org/10.1016/j.jbankfin.2019.03.005 doi: 10.1016/j.jbankfin.2019.03.005
[18]	Bangladesh Bank (2022) Annual Report. Available from: https://www.bb.org.bd/pub/annual/anreport/ar2122.pdf.
[19]	Barajas A, Steiner R, Salazar N (2000) The impact of liberalization and foreign investment in Colombia's financial sector. J Dev Econ 63: 157–196. https://doi.org/10.1016/S0304-3878(00)00104-8 doi: 10.1016/S0304-3878(00)00104-8
[20]	Barra C, Ruggiero N (2021) Do microeconomic and macroeconomic factors influence Italian bank credit risk in different local markets? Evidence from cooperative and non-cooperative banks. J Econ Bus 114: 105976. https://doi.org/10.1016/j.jeconbus.2020.105976 doi: 10.1016/j.jeconbus.2020.105976
[21]	Basel Committee on Banking Supervision (2009) Strengthening the resilience of the banking sector. B. f. I. Settlements. Available from: https://www.bis.org/publ/bcbs164.pdf.
[22]	Basel Committee on Banking Supervision (2010) An assessment of the long-term economic impact of stronger capital and liquidity requirements. B. f. I. Settlements. Available from: https://www.bis.org/publ/bcbs173.pdf.
[23]	Ben Naceur S, Kandil M (2009) The impact of capital requirements on banks' cost of intermediation and performance: The case of Egypt. J Econ Bus 61: 70–89. https://doi.org/10.1016/j.jeconbus.2007.12.001 doi: 10.1016/j.jeconbus.2007.12.001
[24]	Berger AN, Bonaccorsi di Patti E (2006) Capital structure and firm performance: A new approach to testing agency theory and an application to the banking industry. J Bank Financ 30: 1065–1102. https://doi.org/10.1016/j.jbankfin.2005.05.015 doi: 10.1016/j.jbankfin.2005.05.015
[25]	Berger AN, Demirgüç-Kunt A, Moshirian F, et al. (2021) The way forward for banks during the COVID-19 crisis and beyond: Government and central bank responses, threats to the global banking industry. J Bank Financ 133: 106303. https://doi.org/10.1016/j.jbankfin.2021.106303 doi: 10.1016/j.jbankfin.2021.106303
[26]	Berger AN, Herring RJ, Szegö GP (1995) The role of capital in financial institutions. J Bank Financ 19: 393–430. https://doi.org/10.1016/0378-4266(95)00002-X doi: 10.1016/0378-4266(95)00002-X
[27]	Berger AN, Klapper LF, Turk-Ariss R (2009) Bank Competition and Financial Stability. J Financ Serv Res 35: 99–118. https://doi.org/10.1007/s10693-008-0050-7 doi: 10.1007/s10693-008-0050-7
[28]	Berger AN, Öztekin Ö, Roman RA (2023) Geographic deregulation and bank capital structure. J Bank Financ 149: 106761. https://doi.org/10.1016/j.jbankfin.2023.106761 doi: 10.1016/j.jbankfin.2023.106761
[29]	Berle A, Means G (1932) The modern corporation and private property. Harcourt, Brace & World, Inc, New York, NY.
[30]	Blundell R, Bond S (1998) Initial conditions and moment restrictions in dynamic panel data models. J Econometrics 87: 115–143. https://doi.org/10.1016/S0304-4076(98)00009-8 doi: 10.1016/S0304-4076(98)00009-8
[31]	Bock RD, Demyanets A (2012) Bank Asset Quality in Emerging Markets: Determinants and Spillovers. I. M. Fund. Available from: https://www.elibrary.imf.org.
[32]	Bougatef K, Mgadmi N (2016) The impact of prudential regulation on bank capital and risk-taking: The case of MENA countries. Span Rev Financ Econ 14: 51–56. https://doi.org/10.1016/j.srfe.2015.11.001 doi: 10.1016/j.srfe.2015.11.001
[33]	Carsamer E, Abbam A, Queku YN (2022) Bank capital, liquidity and risk in Ghana. J Financ Regul Compl 30: 149–166. https://doi.org/10.1108/JFRC-12-2020-0117 doi: 10.1108/JFRC-12-2020-0117
[34]	Chen YS, Chen Y, Lin CY, et al. (2016) Is there a bright side to government banks? Evidence from the global financial crisis. J Financ Stabil 26: 128–143. https://doi.org/10.1016/j.jfs.2016.08.006 doi: 10.1016/j.jfs.2016.08.006
[35]	Claeys S, Vander Vennet R (2008) Determinants of bank interest margins in Central and Eastern Europe: A comparison with the West. Econ Syst 32: 197–216. https://doi.org/10.1016/j.ecosys.2007.04.001 doi: 10.1016/j.ecosys.2007.04.001
[36]	Clarke GRG, Cull R, Shirley MM (2005) Bank privatization in developing countries: A summary of lessons and findings. J Bank Financ 29: 1905–1930. https://doi.org/10.1016/j.jbankfin.2005.03.006 doi: 10.1016/j.jbankfin.2005.03.006
[37]	Cornett MM, Guo L, Khaksari S, et al. (2010) The impact of state ownership on performance differences in privately-owned versus state-owned banks: An international comparison. J Financ Intermed 19: 74–94. https://doi.org/10.1016/j.jfi.2008.09.005 doi: 10.1016/j.jfi.2008.09.005
[38]	Cruz-García P, Fernández de Guevara J (2019) Determinants of net interest margin: the effect of capital requirements and deposit insurance scheme. Eu J Financ 26: 1102–1123. https://doi.org/10.1080/1351847X.2019.1700149 doi: 10.1080/1351847X.2019.1700149
[39]	Demirgüç-Kunt A, Huizinga H (1999) Determinants of Commercial Bank Interest Margins and Profitability: Some International Evidence. World Bank Econ Rev 13: 379–408. https://doi.org/10.1093/wber/13.2.379 doi: 10.1093/wber/13.2.379
[40]	Deng W, Wang J, Zhang R (2022) Measures of concordance and testing of independence in multivariate structure. J Multivariate Anal 191: 105035. https://doi.org/10.1016/j.jmva.2022.105035 doi: 10.1016/j.jmva.2022.105035
[41]	Dwumfour RA (2017) Explaining banking stability in Sub-Saharan Africa. Res Int Bus Financ 41: 260–279. https://doi.org/10.1016/j.ribaf.2017.04.027 doi: 10.1016/j.ribaf.2017.04.027
[42]	Entrop O, Memmel C, Ruprecht B, et al. (2015) Determinants of bank interest margins: Impact of maturity transformation. J Bank Financ 54: 1–19. https://doi.org/10.1016/j.jbankfin.2014.12.001 doi: 10.1016/j.jbankfin.2014.12.001
[43]	Fernández-Méndez C, González VM (2019) Bank ownership, lending relationships and capital structure: Evidence from Spain. BRQ Bus Res Quart 22: 137–154. https://doi.org/10.1016/j.brq.2018.05.002 doi: 10.1016/j.brq.2018.05.002
[44]	Fiordelisi F, Marques-Ibanez D, Molyneux P (2011) Efficiency and risk in European banking. J Bank Financ 35: 1315–1326. https://doi.org/10.1016/j.jbankfin.2010.10.005 doi: 10.1016/j.jbankfin.2010.10.005
[45]	Fleming G, Heaney R, McCosker R (2005) Agency costs and ownership structure in Australia. Pac-Basin Financ J 13: 29–52. https://doi.org/10.1016/j.pacfin.2004.04.001 doi: 10.1016/j.pacfin.2004.04.001
[46]	Fu XM, Lin YR, Molyneux P (2014) Bank competition and financial stability in Asia Pacific. J Bank Financ 38: 64–77. https://doi.org/10.1016/j.jbankfin.2013.09.012 doi: 10.1016/j.jbankfin.2013.09.012
[47]	Fungáčová Z, Poghosyan T (2011) Determinants of bank interest margins in Russia: Does bank ownership matter? Econ Syst 35: 481–495. https://doi.org/10.1016/j.ecosys.2010.11.007 doi: 10.1016/j.ecosys.2010.11.007
[48]	Garel A, Petit-Romec A, Vennet RV (2022) Institutional Shareholders and Bank Capital. J Financ Intermed 50: 100960. https://doi.org/10.1016/j.jfi.2022.100960 doi: 10.1016/j.jfi.2022.100960
[49]	Gertler M, Kiyotaki N, Queralto A (2012) Financial crises, bank risk exposure and government financial policy. J Monetary Econ 59: S17–S34. https://doi.org/10.1016/j.jmoneco.2012.11.007 doi: 10.1016/j.jmoneco.2012.11.007
[50]	Gharaibeh AMO (2023) The Determinants of Capital Adequacy in the Jordanian Banking Sector: An Autoregressive Distributed Lag-Bound Testing Approach. Int J Financ Stud 11: 75. https://doi.org/https://doi.org/10.3390/ijfs11020075 doi: 10.3390/ijfs11020075
[51]	Gitman LJ, Zutter CJ (2015) Principles of Managerial Finance (6th ed.). Pearson Higher Education AU.
[52]	Gorton G, Winton A (1998) Banking in Transition Economies: Does Efficiency Require Instability? J Money Credit Bank 30: 621–650. https://doi.org/10.2307/2601261 doi: 10.2307/2601261
[53]	Granger CW (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica J Econometric Soc 37: 424–438. https://doi.org/10.2307/1912791 doi: 10.2307/1912791
[54]	Gupta AD, Sarker N, Rahman MR (2021) Relationship among cost of financial intermediation, risk, and efficiency: Empirical evidence from Bangladeshi commercial banks. Cogent Econ Financ 9: 1967575. https://doi.org/10.1080/23322039.2021.1967575 doi: 10.1080/23322039.2021.1967575
[55]	Gupta AD, Yesmin A (2022) Effect of Risk and Market Competition on Efficiency of Commercial Banks: Does Ownership Matter. J Bus Econ Financ 11: 22–42. https://doi.org/10.17261/Pressacademia.2022.1550 doi: 10.17261/Pressacademia.2022.1550
[56]	Gupta J, Kashiramka S, Ly KC, et al. (2023) The interrelationship between bank capital and liquidity creation: A non-linear perspective from the Asia-Pacific region. Int Rev Econ Financ 85: 793–820. https://doi.org/10.1016/j.iref.2023.02.017 doi: 10.1016/j.iref.2023.02.017
[57]	Gupta N, Mahakud J, McMillan D (2020) Ownership, bank size, capitalization and bank performance: Evidence from India. Cogent Econ Financ 8. https://doi.org/10.1080/23322039.2020.1808282 doi: 10.1080/23322039.2020.1808282
[58]	Gupta PK, Sharma S (2023) Role of corporate governance in asset quality of banks: comparison between government-owned and private banks. Manag Financ 49: 724–740. https://doi.org/10.1108/MF-04-2022-0165 doi: 10.1108/MF-04-2022-0165
[59]	Hanzlík P, Teplý P (2020) Key factors of the net interest margin of European and US banks in a low interest rate environment. Int J Financ Econ 27: 2795–2818. https://doi.org/10.1002/ijfe.2299 doi: 10.1002/ijfe.2299
[60]	Haris M, Tan Y, Malik A, et al. (2020) A study on the impact of capitalization on the profitability of banks in emerging markets: A case of Pakistan. J Risk Financ Manage 13: 217. https://doi.org/10.3390/jrfm13090217 doi: 10.3390/jrfm13090217
[61]	Ho TSY, Saunder A (1981) The determinants of bank interest margins: theory and empirical evidence. J Financ Quant Anal 16: 581–600. https://doi.org/10.2307/2330377 doi: 10.2307/2330377
[62]	Huang FW, Chen S, Tsai JY (2019) Optimal bank interest margin under capital regulation: bank as a liquidity provider. J Financ Econ Policy 11: 158–173. https://doi.org/10.1108/JFEP-12-2017-0124 doi: 10.1108/JFEP-12-2017-0124
[63]	Hussain M, Bashir U (2020) Risk-competition nexus: Evidence from Chinese banking industry. Asia Pac Manag Rev 25: 23–37. https://doi.org/10.1016/j.apmrv.2019.06.001 doi: 10.1016/j.apmrv.2019.06.001
[64]	Iannotta G, Nocera G, Sironi A (2013) The impact of government ownership on bank risk. J Financ Int 22: 152–176. https://doi.org/10.1016/j.jfi.2012.11.002 doi: 10.1016/j.jfi.2012.11.002
[65]	Islam MS, Nishiyama SI (2016) The determinants of bank net interest margins: A panel evidence from South Asian countries. Res Int Bus Financ 37: 501–514. https://doi.org/10.1016/j.ribaf.2016.01.024 doi: 10.1016/j.ribaf.2016.01.024
[66]	Jensen MC, Meckling WH (1976) Theory of the firm: Managerial behaviour, agency costs and ownership structure. J Financ Econ 3: 305–360. https://doi.org/10.1016/0304-405X(76)90026-X doi: 10.1016/0304-405X(76)90026-X
[67]	Jiang C, Liu H, Molyneux P (2019) Do different forms of government ownership matter for bank capital behavior? Evidence from China. J Financ Stabil 40: 38–49. https://doi.org/10.1016/j.jfs.2018.11.005 doi: 10.1016/j.jfs.2018.11.005
[68]	Jones JD (1989) A comparison of lag–length selection techniques in tests of Granger causality between money growth and inflation: evidence for the US, 1959–86. Appl Econ 21: 809–822. https://doi.org/https://doi.org/10.1080/758520275 doi: 10.1080/758520275
[69]	Kanapiyanova K, Faizulayev A, Ruzanov R, et al. (2023) Does social and governmental responsibility matter for financial stability and bank profitability? Evidence from commercial and Islamic banks. J Islamic Account Bus Res 14: 451–472. https://doi.org/10.1108/JIABR-01-2022-0004 doi: 10.1108/JIABR-01-2022-0004
[70]	Kasman S, Kasman A (2015) Bank competition, concentration and financial stability in the Turkish banking industry. Econ Syst 39: 502–517. https://doi.org/10.1016/j.ecosys.2014.12.003 doi: 10.1016/j.ecosys.2014.12.003
[71]	Khan MS, Senhadji AS (2001) Threshold effects in the relationship between inflation and growth. IMF Staff Papers 48: 1–21. https://doi.org/10.2307/4621658 doi: 10.2307/4621658
[72]	Kimeldorf G, May JH, Sampson AR (1982) Concordant and discordant monotone correlations and their evaluation by nonlinear optimization. Stud Manag Sci 19: 117–130. https://apps.dtic.mil/sti/pdfs/ADA093816.pdf
[73]	La Porta R, Lopez‐de‐Silanes F, Shleifer A (2002) Government ownership of banks. J Financ 57: 265–301. https://doi.org/10.1111/1540-6261.00422 doi: 10.1111/1540-6261.00422
[74]	Lazopoulos I (2013) Liquidity uncertainty and intermediation. J Bank Financ 37: 403–414. https://doi.org/10.1016/j.jbankfin.2012.09.026 doi: 10.1016/j.jbankfin.2012.09.026
[75]	Lepetit L, Saghi-Zedek N, Tarazi A (2015) Excess control rights, bank capital structure adjustments, and lending. J Financ Econ 115: 574–591. https://doi.org/10.1016/j.jfineco.2014.10.004 doi: 10.1016/j.jfineco.2014.10.004
[76]	Levine R (1997) Financial Development and Economic Growth: Views and Agenda. J Econ lit 35: 688–726. https://www.jstor.org/stable/2729790
[77]	Liu Y, Brahma S, Boateng A (2020) Impact of ownership structure and ownership concentration on credit risk of Chinese commercial banks. Int J Manag Financ 16: 253–272. https://doi.org/10.1108/IJMF-03-2019-0094 doi: 10.1108/IJMF-03-2019-0094
[78]	Marinkovic S, Radovic O (2010) On the determinants of interest margin in transition banking: the case of Serbia. Manag Financ 36: 1028–1042. https://doi.org/10.1108/03074351011088432 doi: 10.1108/03074351011088432
[79]	Mateev M, Nasr T (2023) Banking system stability in the MENA region: the impact of market power and capital requirements on banks' risk-taking behavior. Int J Islamic Middle Eastern Financ Manag 16: 1107–1140. https://doi.org/10.1108/IMEFM-05-2022-0198 doi: 10.1108/IMEFM-05-2022-0198
[80]	Mateev M, Tariq MU, Sahyouni A (2021) Competition, capital growth and risk-taking in emerging markets: Policy implications for banking sector stability during COVID-19 pandemic. PLoS One 16: e0253803. https://doi.org/10.1371/journal.pone.0253803 doi: 10.1371/journal.pone.0253803
[81]	Maudos Jn, Fernández de Guevara J (2004) Factors explaining the interest margin in the banking sectors of the European Union. J Bank Financ 28: 2259–2281. https://doi.org/10.1016/j.jbankfin.2003.09.004 doi: 10.1016/j.jbankfin.2003.09.004
[82]	McShane RW, Sharpe IG (1985) A time series/cross section analysis of the determinants of Australian trading bank loan/deposit interest margins: 1962–1981. J Bank Financ 9: 115–136. https://doi.org/10.1016/0378-4266(85)90065-2 doi: 10.1016/0378-4266(85)90065-2
[83]	Mehrotra P, Vyas V, Naik PK (2023) Behaviour of Capital and Risk Under Basel Regulations: A Simultaneous Equations Model Study of Indian Commercial Banks. Glob Bus Rev 1–19. https://doi.org/10.1177/09721509221146059 doi: 10.1177/09721509221146059
[84]	Mehzabin S, Shahriar A, Hoque MN, et al. (2023) The effect of capital structure, operating efficiency and non-interest income on bank profitability: new evidence from Asia. Asian J Econ Bank 7: 25–44. https://doi.org/10.1108/AJEB-03-2022-0036 doi: 10.1108/AJEB-03-2022-0036
[85]	Mekonnen Y (2015) Determinants of capital adequacy of Ethiopia commercial banks. Eu Sci J 11: 315–331. https://eujournal.org/index.php/esj/article/view/6222
[86]	Mia MDR (2023) Market competition, capital regulation and cost of financial intermediation: an empirical study on the banking sector of Bangladesh. Asian J Econ Bank 7: 251–276. https://doi.org/10.1108/AJEB-03-2022-0028 doi: 10.1108/AJEB-03-2022-0028
[87]	Modigliani F, Miller MH (1958) The Cost of Capital, Corporation Finance and the Theory of Investment. Am Econ Rev 48: 261–297. https://www.jstor.org/stable/1809766
[88]	Moudud-Ul-Huq S (2021) Does bank competition matter for performance and risk-taking? Empirical evidence from BRICS countries. Int J Emerg Mark 16: 409–447. https://doi.org/10.1108/IJOEM-03-2019-0197 doi: 10.1108/IJOEM-03-2019-0197
[89]	Moudud-Ul-Huq S, Ahmed K, Chowdhury MAF, et al. (2022) How do banks' capital regulation and risk-taking respond to COVID-19? Empirical insights of ownership structure. Int J Islamic Middle Eastern Financ Manag 15: 406–424. https://doi.org/10.1108/IMEFM-07-2020-0372 doi: 10.1108/IMEFM-07-2020-0372
[90]	Moussa MAB (2018) Determinants of bank capital: Case of Tunisia. J Appl Financ Bank 8: 1–15. Available from: http://www.scienpress.com/Upload/JAFB/Vol%208_2_1.pdf.
[91]	Mujtaba G, Akhtar Y, Ashfaq S, et al. (2021) The nexus between Basel capital requirements, risk-taking and profitability: what about emerging economies? Econ Res-Ekon Istraž 1–22. https://doi.org/10.1080/1331677X.2021.1890177 doi: 10.1080/1331677X.2021.1890177
[92]	Nguyen TPT, Nghiem SH (2015) The interrelationships among default risk, capital ratio and efficiency. Manag Financ 41: 507–525. https://doi.org/10.1108/MF-12-2013-0354 doi: 10.1108/MF-12-2013-0354
[93]	Nickell S (1981) Biases in dynamic models with fixed effects. Econometrica: J Econometric Soc 49: 1417–1426. https://doi.org/10.2307/1911408 doi: 10.2307/1911408
[94]	Otero L, Razia A, Cunill OM, et al. (2020) What determines efficiency in MENA banks? J Bus Res 112: 331–341. https://doi.org/10.1016/j.jbusres.2019.11.002 doi: 10.1016/j.jbusres.2019.11.002
[95]	Ozili PK, Uadiale O (2017) Ownership concentration and bank profitability. Futur Bus J 3: 159–171. https://doi.org/10.1016/j.fbj.2017.07.001 doi: 10.1016/j.fbj.2017.07.001
[96]	Peia O, Vranceanu R (2018) The cost of capital in a model of financial intermediation with coordination frictions. Oxford Econ Pap 70: 266–285. https://doi.org/10.1093/oep/gpx037 doi: 10.1093/oep/gpx037
[97]	Poghosyan T (2010) Re-examining the impact of foreign bank participation on interest margins in emerging markets. Emerg Mark Rev 11: 390–403. https://doi.org/10.1016/j.ememar.2010.08.003 doi: 10.1016/j.ememar.2010.08.003
[98]	Pushner GM (1995) Equity ownership structure, leverage, and productivity: Empirical evidence from Japan. Pac-Basin Financ J 3: 241–255. https://doi.org/10.1016/0927-538X(95)00003-4 doi: 10.1016/0927-538X(95)00003-4
[99]	Puspitasari E, Sudiyatno B, Hartoto WE, et al. (2021) Net interest margin and return on assets: A Case Study in Indonesia. J Asian Financ Econ Bus 8: 727–734. https://doi.org/10.13106/jafeb.2021.vol8.no4.0727 doi: 10.13106/jafeb.2021.vol8.no4.0727
[100]	Raharjo PG, Hakim DB, Manurung AH, et al. (2014) Determinant of capital ratio: A panel data analysis on state-owned banks in Indonesia. Bull Monetary Econ Bank 16: 395–414. https://doi.org/10.21098/bemp.v16i4.19 doi: 10.21098/bemp.v16i4.19
[101]	Rahman M, Ashraf B, Zheng C, et al. (2017) Impact of Cost Efficiency on Bank Capital and the Cost of Financial Intermediation: Evidence from BRICS Countries. Int J Financ Stud 5: 32. https://doi.org/10.3390/ijfs5040032 doi: 10.3390/ijfs5040032
[102]	Rahman MM, Rahman M, Masud MAK (2023) Determinants of the Cost of Financial Intermediation: Evidence from Emerging Economies. Int J Financ Stud 11. https://doi.org/10.3390/ijfs11010011 doi: 10.3390/ijfs11010011
[103]	Rahman MM, Zheng C, Ashraf BN, et al. (2018) Capital requirements, the cost of financial intermediation and bank risk-taking: Empirical evidence from Bangladesh. Res Int Bus Financ 44: 488–503. https://doi.org/10.1016/j.ribaf.2017.07.119 doi: 10.1016/j.ribaf.2017.07.119
[104]	Rakshit B, Bardhan S (2019) Bank Competition and its Determinants: Evidence from Indian Banking. Int J Econ Bus 26: 283–313. https://doi.org/10.1080/13571516.2019.1592995 doi: 10.1080/13571516.2019.1592995
[105]	Rastogi S, Gupte R, Meenakshi R (2021) A holistic perspective on bank performance using regulation, profitability, and risk-taking with a view on ownership concentration. J Risk Financ Manag 14: 111. https://doi.org/10.3390/jrfm14030111 doi: 10.3390/jrfm14030111
[106]	Roodman D (2009) How to do xtabond2: An introduction to difference and system GMM in Stata. Stata J 9: 86–136. https://doi.org//10.1177/1536867X0900900106 doi: 10.1177/1536867X0900900106
[107]	Saeed M, Izzeldin M, Hassan MK, et al. (2020) The inter-temporal relationship between risk, capital and efficiency: The case of Islamic and conventional banks. Pac-Basin Financ J 62: 101328. https://doi.org/10.1016/j.pacfin.2020.101328 doi: 10.1016/j.pacfin.2020.101328
[108]	Saif-Alyousfi AY, Saha A (2021) Determinants of banks' risk-taking behavior, stability and profitability: Evidence from GCC countries. Int J Islamic Middle Eastern Financ Manag 14: 874–907. https://doi.org/10.1108/IMEFM-03-2019-0129 doi: 10.1108/IMEFM-03-2019-0129
[109]	Sapienza P (2004) The effects of government ownership on bank lending. J Financ Econ 72: 357–384. https://doi.org/10.1016/j.jfineco.2002.10.002 doi: 10.1016/j.jfineco.2002.10.002
[110]	Saunders A, Wilson B (2001) An analysis of bank charter value and its risk-constraining incentives. J Financ Serv Res 19: 185–195. https://doi.org/10.1023/A:1011163522271 doi: 10.1023/A:1011163522271
[111]	Sensarma R, Ghosh S (2004) Net interest margin: does ownership matter? Vikalpa 29: 41–48. https://doi.org/10.1177/0256090920040104 doi: 10.1177/0256090920040104
[112]	Shabir M, Jiang P, Wang W, et al. (2023) COVID-19 pandemic impact on banking sector: A cross-country analysis. J Multinatl Financ Manag 67: 100784. https://doi.org/10.1016/j.mulfin.2023.100784 doi: 10.1016/j.mulfin.2023.100784
[113]	Shawtari FA, Ariff M, Razak SHA (2019) Efficiency and bank margins: a comparative analysis of Islamic and conventional banks in Yemen. J Islamic Account Bus Res 10: 50–72. https://doi.org/10.1108/JIABR-07-2015-0033 doi: 10.1108/JIABR-07-2015-0033
[114]	Shleifer A, Vishny RW (1997) A survey of corporate governance. J Financ 52: 737–783. https://doi.org/10.1111/j.1540-6261.1997.tb04820.x doi: 10.1111/j.1540-6261.1997.tb04820.x
[115]	Soedarmono W, Tarazi A (2013) Bank opacity, intermediation cost and globalization: Evidence from a sample of publicly traded banks in Asia. J Asian Econ 29: 91–100. https://doi.org/10.1016/j.asieco.2013.09.003 doi: 10.1016/j.asieco.2013.09.003
[116]	Tabak BM, Fazio DM, Cajueiro DO (2012) The relationship between banking market competition and risk-taking: Do size and capitalization matter? J Bank Financ 36: 3366–3381. https://doi.org/10.1016/j.jbankfin.2012.07.022 doi: 10.1016/j.jbankfin.2012.07.022
[117]	Thompson CG, Kim RS, Aloe AM, et al. (2017) Extracting the Variance Inflation Factor and Other Multicollinearity Diagnostics from Typical Regression Results. Basic Appl Soc Psych 39: 81–90. https://doi.org/10.1080/01973533.2016.1277529 doi: 10.1080/01973533.2016.1277529
[118]	Toumi K (2019) Islamic ethics, capital structure and profitability of banks; what makes Islamic banks different? Int J Islamic Middle Eastern Financ Manag 13: 116–134. https://doi.org/10.1108/imefm-05-2016-0061 doi: 10.1108/imefm-05-2016-0061
[119]	Trinugroho I, Agusman A, Tarazi A (2014) Why have bank interest margins been so high in Indonesia since the 1997/1998 financial crisis? Res Int Bus Financ 32: 139–158. https://doi.org/10.1016/j.ribaf.2014.04.001 doi: 10.1016/j.ribaf.2014.04.001
[120]	Vo XV (2018) Bank lending behavior in emerging markets. Financ Res Lett 27: 129–134. https://doi.org/10.1016/j.frl.2018.02.011 doi: 10.1016/j.frl.2018.02.011
[121]	Yu HC (2000) Banks' capital structure and the liquid asset–policy implication of Taiwan. Pac Econ Rev 5: 109–114. https://doi.org/10.1111/1468-0106.00093 doi: 10.1111/1468-0106.00093
[122]	Zheng C, Chowdhury MM, Khan MAM, et al. (2023) Effects of ownership on the relationship between bank capital and financial performance: evidence from Bangladesh. Int J Res Bus Soc Sci 12: 260–274. https://doi.org/10.20525/ijrbs.v12i9.2787 doi: 10.20525/ijrbs.v12i9.2787
[123]	Zheng C, Gupta AD, Moudud-Ul-Huq S (2018) Do human capital and cost efficiency affect risk and capital of commercial banks? An empirical study of a developing country. Asian Econ Financ Rev 8: 22–37. https://doi.org/10.18488/journal.aefr.2018.81.22.37 doi: 10.18488/journal.aefr.2018.81.22.37

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Data Science in Finance and Economics

2.2

Metrics

Article views(1797) PDF downloads(88) Cited by(2)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Data Science in Finance and Economics

Investigating the influence of ownership on the relationship between bank capital and the cost of financial intermediation

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Depression detection based on EEG signals

2.2. Depression detection based on speech signals

2.3. Multimodal depression detection

3. Materials and methods

3.1. Discrete wavelet transform

3.2. GCN module

3.3. Multi-head attention module

3.4. Vision transformer module

3.5. Decision-level weighted fusion

4. Dataset and data preprocessing

4.1. Dataset

4.2. Data preprocessing

4.2.1. EEG data preprocessing

4.2.2. Speech data preprocessing

5. Experimental results and analysis

5.1. Experimental environment configuration

5.2. Evaluation metrics

5.3. Experimental results

5.3.1. Validation of the effectiveness of the MHA-GCN and decision-level fusion

5.3.2. Ablation study of multimodal fusion

5.3.3. Comparison with other models

6. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. Related work

2.1. Depression detection based on EEG signals

2.2. Depression detection based on speech signals

2.3. Multimodal depression detection

3. Materials and methods

3.1. Discrete wavelet transform

3.2. GCN module

3.3. Multi-head attention module

3.4. Vision transformer module

3.5. Decision-level weighted fusion

4. Dataset and data preprocessing

4.1. Dataset

4.2. Data preprocessing

4.2.1. EEG data preprocessing

4.2.2. Speech data preprocessing

5. Experimental results and analysis

5.1. Experimental environment configuration

5.2. Evaluation metrics

5.3. Experimental results

5.3.1. Validation of the effectiveness of the MHA-GCN and decision-level fusion

5.3.2. Ablation study of multimodal fusion

5.3.3. Comparison with other models

6. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References