Fall detection based on dynamic key points incorporating preposed attention

Kun Zheng; Bin Li; Yu Li; Peng Chang; Guangmin Sun; Hui Li; Junjie Zhang; Kun Zheng; Bin Li; Yu Li; Peng Chang; Guangmin Sun; Hui Li; Junjie Zhang

doi:10.3934/mbe.2023498

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 6: 11238-11259. doi: 10.3934/mbe.2023498

Previous Article Next Article

Research article

Fall detection based on dynamic key points incorporating preposed attention

1.
Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
2.
Smart Learning Institute, Beijing Normal University, Beijing 100875, China

Academic Editor: Chunwei Tian

Received: 08 March 2023 Revised: 07 April 2023 Accepted: 16 April 2023 Published: 25 April 2023

Accidental falls pose a significant threat to the elderly population, and accurate fall detection from surveillance videos can significantly reduce the negative impact of falls. Although most fall detection algorithms based on video deep learning focus on training and detecting human posture or key points in pictures or videos, we have found that the human pose-based model and key points-based model can complement each other to improve fall detection accuracy. In this paper, we propose a preposed attention capture mechanism for images that will be fed into the training network, and a fall detection model based on this mechanism. We accomplish this by fusing the human dynamic key point information with the original human posture image. We first propose the concept of dynamic key points to account for incomplete pose key point information in the fall state. We then introduce an attention expectation that predicates the original attention mechanism of the depth model by automatically labeling dynamic key points. Finally, the depth model trained with human dynamic key points is used to correct the detection errors of the depth model with raw human pose images. Our experiments on the Fall Detection Dataset and the UP-Fall Detection Dataset demonstrate that our proposed fall detection algorithm can effectively improve the accuracy of fall detection and provide better support for elderly care.

Keywords:

Citation: Kun Zheng, Bin Li, Yu Li, Peng Chang, Guangmin Sun, Hui Li, Junjie Zhang. Fall detection based on dynamic key points incorporating preposed attention[J]. Mathematical Biosciences and Engineering, 2023, 20(6): 11238-11259. doi: 10.3934/mbe.2023498

Related Papers:

[1]	Xiaodan Zhang, Shuyi Wang, Kemeng Xu, Rui Zhao, Yichong She . Cross-subject EEG-based emotion recognition through dynamic optimization of random forest with sparrow search algorithm. Mathematical Biosciences and Engineering, 2024, 21(3): 4779-4800. doi: 10.3934/mbe.2024210
[2]	Binju Saju, Neethu Tressa, Rajesh Kumar Dhanaraj, Sumegh Tharewal, Jincy Chundamannil Mathew, Danilo Pelusi . Effective multi-class lungdisease classification using the hybridfeature engineering mechanism. Mathematical Biosciences and Engineering, 2023, 20(11): 20245-20273. doi: 10.3934/mbe.2023896
[3]	Basem Assiri, Mohammad Alamgir Hossain . Face emotion recognition based on infrared thermal imagery by applying machine learning and parallelism. Mathematical Biosciences and Engineering, 2023, 20(1): 913-929. doi: 10.3934/mbe.2023042
[4]	Yufeng Qian . Exploration of machine algorithms based on deep learning model and feature extraction. Mathematical Biosciences and Engineering, 2021, 18(6): 7602-7618. doi: 10.3934/mbe.2021376
[5]	Xu Yin, Ming Meng, Qingshan She, Yunyuan Gao, Zhizeng Luo . Optimal channel-based sparse time-frequency blocks common spatial pattern feature extraction method for motor imagery classification. Mathematical Biosciences and Engineering, 2021, 18(4): 4247-4263. doi: 10.3934/mbe.2021213
[6]	Kunpeng Li, Zepeng Wang, Yu Zhou, Sihai Li . Lung adenocarcinoma identification based on hybrid feature selections and attentional convolutional neural networks. Mathematical Biosciences and Engineering, 2024, 21(2): 2991-3015. doi: 10.3934/mbe.2024133
[7]	Dingxin Xu, Xiwen Qin, Xiaogang Dong, Xueteng Cui . Emotion recognition of EEG signals based on variational mode decomposition and weighted cascade forest. Mathematical Biosciences and Engineering, 2023, 20(2): 2566-2587. doi: 10.3934/mbe.2023120
[8]	Yan Yan, Yong Qian, Hongzhong Ma, Changwu Hu . Research on imbalanced data fault diagnosis of on-load tap changers based on IGWO-WELM. Mathematical Biosciences and Engineering, 2023, 20(3): 4877-4895. doi: 10.3934/mbe.2023226
[9]	Jie Bai, Heru Xue, Xinhua Jiang, Yanqing Zhou . Classification and recognition of milk somatic cell images based on PolyLoss and PCAM-Reset50. Mathematical Biosciences and Engineering, 2023, 20(5): 9423-9442. doi: 10.3934/mbe.2023414
[10]	Yuzhuo Shi, Huijie Zhang, Zhisheng Li, Kun Hao, Yonglei Liu, Lu Zhao . Path planning for mobile robots in complex environments based on improved ant colony algorithm. Mathematical Biosciences and Engineering, 2023, 20(9): 15568-15602. doi: 10.3934/mbe.2023695

Abstract

1. Introduction

Retinal tears arise from vitreous traction on the retina or degeneration and atrophy of the retina, and it is frequently observed in individuals who have acute posterior vitreous detachment ^[1]. The identification of retinal tears, which serve as a risk factor for the occurrence of retinal detachment, poses a significant challenge. In the absence of timely detection and intervention, 30–50% of the cases will progress to retinal detachment ^[2], a condition that leads to severe blinding. In most cases, retinal tears can be diagnosed by using indirect fundoscopy in conjunction with scleral pressure examination ^[3]. However, in situations where the patient's refracting media is murky, B-scan ultrasound emerges as a viable option among the limited alternative diagnostic tools available. Moreover, ultrasound is also more accessible and less expensive than other types like OCT and ultra-wide-field imaging. It is widely prevalent and available in many local hospitals and primary community clinics. However, conventional manual methods require the involvement of highly skilled physicians to prevent their potential oversight or misdiagnosis ^[4]. In this context, only a few of the large hospitals in China have professional sonographers, as is the case in other developing countries and regions. As a result, the development of a model capable of automatically diagnosing retinal tears is critical and urgent ^[5].

Deep learning represents the most effective approach to automating the development of diagnostic systems. Previous studies have proposed a multitude of models, with predominant focus on the utilization of convolutional neural networks (CNNs) ^[6,7]. For example, Li et al. ^[8] screened for notable peripheral retinal lesions (NPRLs) by using numerous models, such as InceptionResNetV2, InceptionV3, ResNet50 and VGG16. Furthermore, with an accuracy of 79.8%, a system based on seResNet50 was developed by Zhang et al. ^[9] to screen numerous types of NPRLs. However, the inability of the CNN to capture long-distance image features hinders its continued development. In this context, Dosovitskiy et al. ^[10] proposed the vision transformer (ViT) as a solution to this problem, using the excellent transformer ^[11] from natural language processing as a point of reference. Subsequently, ViT was observed to outperform CNNs in a multitude of tests after self-attention methods were substituted for convolutional processes. Accordingly, several researchers have made efforts to implement the model in the treatment of ophthalmic disorders, particularly, retinal issues. Jiang et al. ^[12] employed a ViT to automatically identify normal eyes, age-related macular degeneration, and diabetic macular edema, achieving a classification accuracy of 99.69%. Furthermore, a deep learning model based on a ViT was introduced by Wu et al. ^[13] to assess diabetic retinopathy, and it realized an accuracy of 91.4% and a kappa score of 0.935. However, studies that report on the automatic diagnosis of retinal tears are few.

The present study involved the collection and construction of a retinal tear dataset comprising 1831 images, with the aim of developing more effective diagnostic algorithms. Despite the widely acknowledged fact that ViT is data-driven and performs exceptionally well with ample training data, our study encountered a hurdle due to the limited availability of data. Although the use of transfer learning has been demonstrated to be able to partially address this challenge, it should be noted that this approach may not be sufficient and could potentially lead to an increase in computational resources. Consequently, a hybrid structure was devised to introduce inductive bias and enhance the model's adaptability to our limited dataset. Furthermore, through experimental analysis, it has been observed that the utilization of deformable convolution ^[14] affords superior adaptability to the contour of lesions and yields improved performance. Thus, based on the aforementioned rationales, we proposed a novel framework called the deformable convolution and transformer network (DCT-Net) in the current study, which integrates the merits of deformable convolution and the vision transformer. The model was subjected to rigorous testing on two datasets to assess its overall performance and efficacy. Additionally, attention maps were generated in order to validate their interpretability. The current body of research on retinal tear diagnostic systems is limited, and our study has partially addressed this research gap.

To summarize, the main contributions of the present study can be succinctly stated as follows:

● A dataset comprising 1831 B-scan ultrasound images of retinal tears was assembled.

● A novel model that is more appropriate for small datasets of medical images is proposed. To our knowledge, this study represents the first investigation into the utilization of ViT-based architecture for the purpose of identifying retinal tears through the analysis of ultrasound images.

● The efficacy of the model in terms of lesion detection, as well as its commendable performance, are demonstrated through the analysis of two datasets.

2. Materials and methods

The contents of the current study can be categorized into three primary modules: data collection and preprocessing; model design and validation and interpretability analysis and external validation. The flowchart is illustrated in Figure 1.

Figure 1. The flowchart of this research.

DownLoad: Full-Size Img PowerPoint

2.1. Datasets

2.1.1. Data collection

The investigation was carried out in adherence to the Protocol for the Declaration of Helsinki, as amended in 2013.

A comprehensive set of 1902 ultrasound B-scan images was collected for this retrospective study. These samples were obtained from the eye hospital of Wenzhou Medical University for the period from October 2017 to April 2022. All positive samples were verified by professional ophthalmologists. However, the images were collected from a variety of devices with varying resolutions and file types. Thus, to accommodate the model's input, each image underwent a resizing process to 224 × 224 pixels, and any blurry pixels were removed. Finally, 1831 samples (910 positive and 927 negative) were utilized for subsequent investigations.

2.1.2. Data augmentation

Data augmentation is a data processing technique that is employed to enhance the quantity and diversity of training samples by transforming existing data. There are two distinct categories of data arguments, namely, augment online and augment offline. Typically, the former approach is utilized for larger datasets, wherein operations are executed on the data batch. Conversely, the latter approach is employed for smaller datasets, wherein operations are directly performed on the original data ^[15]. Accordingly, the offline method was selected as a result of the limited dataset available for our study. Various data augmentation techniques, including rotation, cropping, brightness shift, contrast modification, horizontal flipping, vertical flipping, etc., can be employed for image augmentation ^[16,17]. However, not all enhancement techniques are universally applicable, because the labels of the image categories could be modified after enhancement. After conducting analysis, we opted to employ horizontal flip, vertical flip and brightness shift techniques in order to enhance the original dataset. Figure 2 illustrates the aforementioned augmentation operations.

Figure 2. Three image augmentation methods. A. Original image; B. Brightness shift; C. Horizontal flip; D. Vertical flip.

DownLoad: Full-Size Img PowerPoint

2.2. DCT-Net

The ViT model is based on direct global relationship modeling and has demonstrated significant accomplishments in the extraction of global features through the use of a multi-head self-attention mechanism. However, it has limitations in its ability to effectively accommodate minuscule lesions, and it proves inadequate when confronted with a limited size of training data. In this context, convolution operations, specifically deformable convolutions, exhibit better adaptability to local detail characteristics. This study presents a novel approach that integrates the ViT and deformable convolution to realize the accurate detection of retinal tears with enhanced precision. Figure 3 presents a visual representation of the proposed model. Furthermore, the utilization of transfer learning technology was employed in this particular aspect to enhance network performance and expedite the training process.

Figure 3. The proposed DCT-Net for retinal tear detection. After entering the classification model, the sample images were successively passed through the feature extractor and the residual deformable convolution block. Finally, the results were obtained as an output through a Softmax layer.

DownLoad: Full-Size Img PowerPoint

2.2.1. Transformer encoder

The input images ( $\mathrm{H}\times \mathrm{W}\times \mathrm{C}$ ) were split into n patches. After these patches were flattened, a linear projection layer was used to convert them to D-dimensional vectors. A class token was also appended, as illustrated in the BERT ^[18]. Following position embedding, the D-dimensional vectors were subsequently transmitted to the Transformer Encoder. Maintaining the dimensions of the vectors was crucial throughout the entire process.

In the Transformer Encoder, the input vectors undergo an initial step of layer normalization, which expedites the convergence of the network. The procedure is denoted by Eq (1) in terms of the mean and standard deviation of the input, respectively.

$\mathrm{L}\mathrm{a}\mathrm{y}\mathrm{e}\mathrm{r}\mathrm{N}\mathrm{o}\mathrm{r}\mathrm{m}\left({\mathrm{x}}_{\mathrm{i}}\right) = \frac{{\mathrm{x}}_{\mathrm{i}}-\mathrm{\mu }}{\sqrt{{\mathrm{\sigma }}^{2}+\mathrm{ϵ}}}$

(1)

The resulting output is used to compute the mutual attention by utilizing multi-head attention layers (as demonstrated in Eqs (2)–(4)). Subsequently, the Layer Norm and Multi-Layer Perceptron layer were employed to obtain the final outputs. The inclusion of residual connections in this process effectively mitigated the issue of gradient vanishing. To optimize the utilization of the transfer learning's weight, we employed an equal number of encoders as the conventional ViT model.

${\mathrm{Q}}_{\mathrm{i}} = \mathrm{Q}{\mathrm{W}}_{\mathrm{i}}^{\mathrm{Q}} , {\mathrm{K}}_{\mathrm{i}} = \mathrm{K}{\mathrm{W}}_{\mathrm{i}}^{\mathrm{K}} , {\mathrm{V}}_{\mathrm{i}} = \mathrm{V}{\mathrm{W}}_{\mathrm{i}}^{\mathrm{V}}$

(2)

${\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}}_{\mathrm{i}} = \mathrm{A}\mathrm{t}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}({\mathrm{Q}}_{\mathrm{i}}, {\mathrm{K}}_{\mathrm{i}}, {\mathrm{V}}_{\mathrm{i}})$

(3)

$\mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{i}\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}(\mathrm{Q}, \mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{K}, \mathrm{ }\mathrm{V}) = \mathrm{C}\mathrm{o}\mathrm{n}\mathrm{c}\mathrm{a}\mathrm{c}\mathrm{t}({\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}}_{1}, {\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}}_{2}, ..., {\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}}_{12})$

(4)

$\mathrm{y}\left({\mathrm{P}}_{0}\right) = \sum _{{\mathrm{P}}_{\mathrm{n}}\in \mathrm{R}}\mathrm{w}\left({\mathrm{P}}_{\mathrm{n}}\right)\cdot \mathrm{x}({\mathrm{P}}_{0}+{\mathrm{P}}_{\mathrm{n}}+{\Delta \mathrm{P}}_{\mathrm{n}})$

(5)

2.2.2. Deformable convolution block

The diagnosis of retinal tears using ultrasound images is highly dependent on the position and shape of the small lesion areas. However, the standard ViT is insufficient for acquiring such localized data. As a result of conventional convolution employing regular kernels, the receptive field remains constant and is ill-equipped to accommodate variations in edge shape. By appending a learnable offset to the standard convolution kernel, deformable convolution can modify the sampling area's shape, bringing it closer to the object's edge. The sampling procedure for deformable convolution and ordinary convolution is presented in Figure 4. Equation (5) illustrates the calculation process.

Figure 4. Sampling process. (a) Common convolution and (b) deformable convolution. The top image shows the result of employing the activation unit on objects. The middle image shows the result of the sampling process performed to obtain the top-level activation unit. The bottom image was used to obtain the sampling area for the middle image.

DownLoad: Full-Size Img PowerPoint

Subsequently, a residual deformable convolution block was devised in order to enhance the extraction of intricate features. Similar to the Transformer Encoder, the designed module initially employs a Batch Norm layer to convert inputs into data with a mean of 1 and a variance of 0. Two deformable convolutional layers were used to capture local concrete detail features. To enhance nonlinearity while minimizing computational workload, the convolutional kernel of the first layer was designed to be larger than that of the second layer. Subsequently, an adaptive average pooling layer was incorporated in order to enhance the efficacy of feature extraction and computational processes. Furthermore, the concept of residual connection was incorporated into the model design, drawing inspiration from Resnet ^[19]. This addition was made in order to mitigate the issue of gradient vanishing ^[20].

2.3. Interpretability analysis

The utilization of pooling layers in a CNN can lead to the merging of position information, potentially resulting in the loss of certain details during the generation of rough heat maps ^[21,22]. Our model effectively captures global features and is founded upon a self-attention mechanism. Moreover, it has the ability to deliver elaborate visualizations to an adequate degree ^[23]. However, attention-based networks are incompatible with the traditional Grad-CAM ^[24] method. This is attributed to the fact that the CNN permits the aggregation of feature map weights from multiple channels, whereas the ViT restricts the addition of distinct patches. Therefore, we adopted the attention rollout method proposed by Samira Abnar ^[25]. Attention rollout in essence calculates the product of the attention matrix from the low level to the high level of the network. The concrete realization is achieved through the recursive calculation of each layer's tokens, computing information from the input layer to the higher level. Concurrently, the residual connection and the weight must be taken into account. It is represented by Eq (6).

${\mathrm{A}\mathrm{t}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{R}\mathrm{o}\mathrm{l}\mathrm{l}\mathrm{o}\mathrm{u}\mathrm{t}}_{\mathrm{L}} = ({\mathrm{A}}_{\mathrm{L}}+\mathrm{I}){\mathrm{A}\mathrm{t}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{R}\mathrm{o}\mathrm{l}\mathrm{l}\mathrm{o}\mathrm{u}\mathrm{t}}_{\mathrm{L}-1}$

(6)

where A_L is the attention matrix of the L layer and I is the identity matrix.

3. Results

3.1. Training strategy

The adoption of a transfer learning strategy was implemented with the aim of expediting the training process and enhancing the performance of the model. The pre-training process was conducted by using the ImageNet dataset, which comprises a vast collection of more than 1000 categories of nature images. The cross-entropy loss ^[26,27] was employed as the loss function in our study. This choice was made to address the issue of the sigmoid function's derivative form, which is susceptible to saturation and results in slow gradient updates. Furthermore, the Adam optimizer ^[28] was also utilized. The approach offers the benefits of rapid convergence and a relatively facile process for configuring hyperparameters.

Furthermore, an early stopping strategy was developed with the intention of mitigating the issue of overfitting. Following each iteration of training, a comprehensive evaluation was conducted on the designated test dataset. The training process was deemed to be complete once the accuracy on the test set ceased to exhibit substantial improvements and stabilized after approximately 10 epochs.

3.2. Performance on private datasets

In order to enhance the precision of an evaluation of the performance of the designed model, a set of widely recognized state-of-the-art (SOTA) models, viz. Alexnet ^[29], Inception v3 ^[30], Resnet101 ^[19], VGG16 ^[31] and ViT, were chosen as the baseline models. The preprocessing steps and training strategies remained consistent across all baselines, with the exception of Inception v3, which required an input size of 299 × 299 pixels.

Table 1 presents a comprehensive overview of the performance metrics for both the baseline models and the model that has been specifically designed for this study. The confusion matrix for multiple models on the test set is depicted in Figure 5. The number in each small square represents the corresponding number of images with the same predicted true label and it is the percentage of the total number of images under the true label. It is worth mentioning that within the category of CNN-based models, Inception v3 exhibited the highest level of performance, achieving an accuracy rate of 96.82%, an F1 score of 0.9605 and an AUC of 0.9828. The ViT model with the pure self-attention mechanism did not perform well; particularly, the performance was even worse than that of the CNN. Nevertheless, our designed model exhibited superior performance across all metrics, surpassing all other models, and only a mere 10 samples were classified incorrectly. To our knowledge, the proposed model exhibited superior performance even as compared to human experts (with a sensitivity of 96%) ^[32].

Table 1. Performance comparison of DCT-Net with baseline models on the classification problem.

Model	Accuracy	Precision	Recall	F1 Score	AUC
Alexnet	95.11%	94.64%	95.68%	0.9456	0.9286
Inception V3	96.82%	96.55%	96.37%	0.9605	0.9828
Resnet101	96.74%	96.94%	96.42%	0.9599	0.9772
VGG16	96.52%	96.42%	96.66%	0.9595	0.9598
Vit	95.76%	95.66%	95.87%	0.9515	0.9444
DCT-Net	97.78%	97.34%	97.13%	0.9682	1.0000

| Show Table

DownLoad: CSV

Figure 5. The confusion matrix for different models on retinal tear datasets. A. Inception V3; B. Vision transformer; C. DCT-Net.

DownLoad: Full-Size Img PowerPoint

3.3. External validation

As an external validation step, we utilized the ORIGA datasets in this section to ensure that the proposed model possesses exceptional generalizability and can adapt to various database types. The dataset comprised a total of 650 images depicting instances of glaucoma. In order to conduct a comparative analysis against other models documented in the literature ^[33,34,35], we used the original dataset without employing any augmentation techniques. Table 2 shows the results, where NMD denotes that the pre-training was performed by using a non-medical dataset, SOD denotes that the pre-training was performed by using a similar ophthalmic dataset and CT-Net denotes that common convolution replaced the deformable convolution. The ViT did not perform well among them, most likely as a result of the limited dataset. On the other hand, the DCT-Net achieved the highest accuracy at 83.8%, demonstrating the best performance. Additionally, the significance of deformable convolution became apparent when it was compared to CT-Net.

Table 2. Performance comparison of the DCT-Net with others on the ORIGA dataset.

Model	Accuracy	Sensitivity	Specificity
CNN	70.4%	70.7%	74.8%
VGG	70.1%	69.8%	71.0%
GoogLeNet	71.8%	69.8%	73.5%
ResNet	71.5%	71.3%	71.7%
Chen ^[34]	70.8%	69.2%	71.0%
Shibata ^[35]	73.3%	73.2%	76.7%
NMD+CNN	74.5%	68.7%	80.7%
SOD+CNN	73.9%	80.9%	72.2%
NMD+Attention	74.9%	71.2%	77.7%
Xu ^[33]	76.6%	75.3%	77.2%
ViT	71.4%	74.0%	67.8%
CT-Net	80.5%	81.7%	80.1%
DCT-Net	83.8%	82.7%	82.4%

| Show Table

DownLoad: CSV

3.4. Interpretation

Models that are easily interpretable offer valuable insights into their inner workings, thereby benefiting both patients and clinicians. Figure 6 displays three attention maps that were generated by using our private dataset. We have used the red circle to mark the lesion parts in the original image. In the attention maps, higher intensity of color is indicative of a greater level of attention. The aforementioned images demonstrate a strong correspondence between the regions of heightened attention and the affected areas of the lesion. This indicated that the model possesses a well-defined operational framework and possesses exceptional interpretive qualities.

Figure 6. The attention maps for three samples. (A) and (B) are the lesion images and (C) is the normal image.

DownLoad: Full-Size Img PowerPoint

3.5. Hardware

The hardware configuration utilized in this study is as follows. The central processing unit (CPU) utilized in the system comprised a 7-core Intel^(R) Xeon^(R) CPU E5-2680 v4 operating at a frequency of 2.40 GHz. Additionally, the system incorporated a single graphics processing unit in the form of an RTX 3070ti with 8 GB of dedicated memory. The training process employed Python version 3.8, PyTorch framework version 1.10.0 for machine learning and CUDA version 11.3.

4. Discussion

CNNs have demonstrated remarkable performance on previous image processing tasks and are widely acknowledged as the SOTA approach. For instance, Yu et al. ^[36,37] employed CNNs for the purpose of detecting concrete cracks, achieving exceptional performance. Ragupathy and Karunakaran ^[38] proposed a CNN-based model for the detection of meningioma brain tumors. The model demonstrated promising performance metrics. However, due to the constraints imposed by the small convolutional kernel, CNNs may not be able to effectively extract global features. As shown in Table 1, it appears that the performance of the CNN-based model has encountered a bottleneck, making further improvements challenging. When comparing the CNN with the ViT, it can be observed that the ViT utilizes the attention mechanism to calculate the relationship between global pixels, thereby enabling a comprehensive global perspective. Numerous studies have substantiated the impressive efficacy of the ViT model ^[39]. However, our investigation revealed that the pure ViT did not perform well on small datasets of retinal tears (with the accuracy of 95.76%).

To enhance the efficacy of lesion detection on limited datasets, a novel architecture was initially devised, integrating the merits of convolution and attention mechanisms. As shown in Table 2, the utilization of global feature extraction techniques contributes to the generation of a relatively comprehensive latent space feature representation. Concurrently, as a result of incorporating the inductive bias of convolution, the proposed model demonstrates substantial enhancements on the limited public dataset, achieving an accuracy of 80.5%. Moreover, replacing ordinary convolutions with deformable convolutions has been found to yield more favorable outcomes, as evidenced by an accuracy rate of 83.8%. This phenomenon could potentially be attributed to the enhanced precision resulting from extracting both the location and shape of the lesion areas. From the perspective of external validation and interpretable analysis, the model possesses robustness and sufficient accuracy.

Notwithstanding the enhanced performance achieved in this study, certain constraints remain. First, ophthalmic ultrasound is highly dependent on the equipment, technique and examiner experience. However, the data collected for this study came from a variety of devices. This may compromise the validity of the results. Second, all of the retinal tear images utilized in this study were procured from a single hospital. This may lead to an absence of diversity in the cases. Moreover, only retinal tears were included in our study. Ultrasound imaging can, in fact, be utilized to diagnose additional retinal disorders. Correspondingly, the value of the model can be enhanced through the incorporation of additional disease types. Finally, the incorporation of the residual deformable convolution module and the utilization of a ViT as the feature extractor resulted in an increased number of parameters for our model (Table 3). This results in increased demands on the environment in terms of model deployment.

Utilizing ultrasound to identify retinal tears is an extremely practical method. It is superior to alternative approaches when it comes to handling intricate clinical scenarios, such as ocular media opacity. However, the extraction of useful features via conventional machine learning methods is hampered by low resolution. Fortunately, the progress that has been made in deep learning enables the analysis of these images in an efficient manner. Our current research is, without a doubt, preliminary in nature. Moving forward, we aim to enhance the model's architecture and implement global vision technology that is more streamlined or possesses a reduced number of parameters. This will allow the effortless deployment of lightweight models across diverse environments. Furthermore, our objective is to enhance the quantity and range of samples gathered in order to prevent issues with model generalization that may arise from discrepancies in the training data. Lastly, we will collaborate with clinicians and conduct additional multicenter studies to precisely quantify the extent to which this model can benefit physicians.

Table 3. Parameters of the different models used in the study.

Model	Parameters(1 × 10⁶)
Alexnet	57.01
Inception v3	25.12
Resnet101	42.5
VGG16	134.27
Vision Transformer	85.80
DCT-Net	138.36

| Show Table

DownLoad: CSV

5. Conclusions

A novel model was developed for the diagnosis of ophthalmological conditions in the current study. The model demonstrated superior performance on both our proprietary dataset and the glaucoma dataset that was publicly available. The framework is a comprehensive computing framework that exhibits superior performance and does not necessitate the generation of manually designed features. Overall, this technology provides significant practical value in the field of clinical application, particularly in the realm of automated diagnosis.

Use of AI tools declaration

The authors declare that they have not used artificial intelligence tools in the creation of this article.

Acknowledgments

We would like to thank all editors and reviews for their careful review and revision of the paper. This research was supported in part by the National Key R & D Program of China [2018YFA0701700].

Conflict of interest

The authors declare that there is no conflict of interest.

References

[1]	Y. Chen, Y. Zhang, B. Xiao, H. Li, A framework for the elderly first aid system by integrating vision-based fall detection and BIM-based indoor rescue routing, Adv. Eng. Inf., 54 (2022), 101766. http://doi.org/10.1016/j.aei.2022.101766 doi: 10.1016/j.aei.2022.101766
[2]	M. Mubashir, L. Shao, L. Seed, A survey on fall detection: Principles and approaches, Neurocomputing, 100 (2013), 144–152. http://doi.org/10.1016/j.neucom.2011.09.037 doi: 10.1016/j.neucom.2011.09.037
[3]	S. Nooruddin, M. Islam, F. A. Sharna, H. Alhetari, M. N. Kabir, Sensor-based fall detection systems: a review, J. Ambient Intell. Hum. Comput., 2009 (2009), 1–17. http://doi.org/10.1109/biocas.2009.5372032 doi: 10.1109/biocas.2009.5372032
[4]	F. A. S. F. de Sousa, C. Escriba, E. G. A. Bravo, V. Brossa, J. Y. Fourniols, C. Rossi, Wearable pre-impact fall detection system based on 3D accelerometer and subject's height, IEEE Sens. J., 22 (2022), 1738–1745. http://doi.org/10.1109/biocas.2009.5372032 doi: 10.1109/biocas.2009.5372032
[5]	Z. Lin, Z. Wang, H. Dai, X. Xia, Efficient fall detection in four directions based on smart insoles and RDAE-LSTM model, Expert Syst. Appl., 205 (2022), 117661. http://doi.org/10.1016/j.eswa.2022.117661 doi: 10.1016/j.eswa.2022.117661
[6]	P. Bet, P. C. Castro, M. A. Ponti, Fall detection and fall risk assessment in older person using wearable sensors: A systematic review, Int. J. Med. Inf., 130 (2019), 103946. http://doi.org/10.1016/j.ijmedinf.2019.08.006 doi: 10.1016/j.ijmedinf.2019.08.006
[7]	I. Boudouane, A. Makhlouf, M. A. Harkat, M. Z. Hammouche, N. Saadia, A. R. Cherif, Fall detection system with portable camera, J. Ambient Intell. Hum. Comput., 11 (2019), 2647–2659. http://doi.org/10.1007/s12652-019-01326-x doi: 10.1007/s12652-019-01326-x
[8]	E. Casilari, C. A. Silva, An analytical comparison of datasets of Real-World and simulated falls intended for the evaluation of wearable fall alerting systems, Measurement, 202 (2022), 111843. http://doi.org/10.1016/j.measurement.2022.111843 doi: 10.1016/j.measurement.2022.111843
[9]	C. Wang, L. Tang, M. Zhou, Y. Ding, X. Zhuang, J. Wu, Indoor human fall detection algorithm based on wireless sensing, Tsinghua Sci. Technol., 27 (2022), 1002–1015. http://doi.org/10.26599/tst.2022.9010011 doi: 10.26599/tst.2022.9010011
[10]	S. Madansingh, T. A. Thrasher, C. S. Layne, B. C. Lee, Smartphone based fall detection system, in 2015 15th International Conference on Control, Automation and Systems, ICCAS, (2015), 370–374. https://doi.org/10.1109/ICCAS.2015.7364941
[11]	B. Wang, Z. Zheng, Y. X. Guo, Millimeter-Wave frequency modulated continuous wave radar-based soft fall detection using pattern contour-confined Doppler-Time maps, IEEE Sens. J., 22 (2022), 9824–9831. http://doi.org/10.1109/jsen.2022.3165188 doi: 10.1109/jsen.2022.3165188
[12]	K. Chaccour, R. Darazi, A. H. El Hassani, E. Andres, From fall detection to fall prevention: A generic classification of fall-related systems, IEEE Sens. J., 17 (2017), 812–822. http://doi.org/10.1109/jsen.2016.2628099 doi: 10.1109/jsen.2016.2628099
[13]	J. Gutiérrez, V. Rodríguez, S. Martin, Comprehensive review of vision-based fall detection systems, Sensors, 21 (2021), 947. https://pubmed.ncbi.nlm.nih.gov/33535373
[14]	C. Y. Hsieh, K. C. Liu, C. N. Huang, W. C. Chu, C. T. Chan, Novel hierarchical fall detection algorithm using a multiphase fall model, Sensors, 17 (2017), 307. https://pubmed.ncbi.nlm.nih.gov/28208694
[15]	L. Ren, Y. Peng, Research of fall detection and fall prevention technologies: A systematic review, IEEE Access, 7 (2019), 77702–77722. http://doi.org/10.1109/access.2019.2922708 doi: 10.1109/access.2019.2922708
[16]	N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2005), 886–893. http://doi.org/10.1109/cvpr.2005.177
[17]	E. Rublee, V. Rabaud, K. Konolige, G. Bradski, ORB: An efficient alternative to SIFT or SURF, in 2011 International Conference on Computer Vision, (2011), 2564–2571. http://doi.org/10.1109/iccv.2011.6126544
[18]	X. Wang, T. X. Han, S. Yan, An HOG-LBP human detector with partial occlusion handling, in 2009 IEEE 12th International Conference on Computer Vision, (2009), 32–39. http://doi.org/10.1109/iccv.2009.5459207
[19]	M. Islam, S. Nooruddin, F. Karray, G. Muhammad, Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges and future prospects, Comput. Biol. Med., 149 (2022), 106060. http://doi.org/10.1109/iccv.2009.5459207
[20]	K. C. Liu, K. H. Hung, C. Y. Hsieh, H. Y. Huang, C. T. Chan, Y. Tsao, Deep-learning-based signal enhancement of low-resolution accelerometer for fall detection systems, IEEE Trans. Cognit. Dev. Syst., 14 (2022), 1270–1281. http://doi.org/10.1109/tcds.2021.3116228 doi: 10.1109/tcds.2021.3116228
[21]	X. Yu, B. Koo, J. Jang, Y. Kim, S. Xiong, A comprehensive comparison of accuracy and practicality of different types of algorithms for pre-impact fall detection using both young and old adults, Measurement, 201 (2022), 111785. http://doi.org/10.2139/ssrn.4132951 doi: 10.2139/ssrn.4132951
[22]	X. Lu, W. Wang, J. Shen, D. J. Crandall, L. Van Gool, Segmenting objects from relational visual data, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 7885–7897. http://doi.org/10.1109/tpami.2021.3115815 doi: 10.1109/tpami.2021.3115815
[23]	H. M. Abdulwahab, S. Ajitha, M. A. N. Saif, Feature selection techniques in the context of big data: taxonomy and analysis, Appl. Intell., 52 (2022), 13568–13613. http://doi.org/10.1007/s10489-021-03118-3 doi: 10.1007/s10489-021-03118-3
[24]	D. Mrozek, A. Koczur, B. Małysiak-Mrozek, Fall detection in older adults with mobile IoT devices and machine learning in the cloud and on the edge, Inf. Sci., 537 (2020), 132–147. http://doi.org/10.1016/j.ins.2020.05.070 doi: 10.1016/j.ins.2020.05.070
[25]	X. Cai, S. Li, X. Liu, G. Han, Vision-based fall detection with multi-task hourglass convolutional auto-encoder, IEEE Access, 8 (2020), 44493–44502. http://doi.org/10.1109/access.2020.2978249 doi: 10.1109/access.2020.2978249
[26]	C. Vishnu, R. Datla, D. Roy, S. Babu, C. K. Mohan, Human fall detection in surveillance videos using fall motion vector modeling, IEEE Sens. J., 21 (2021), 17162–17170. http://doi.org/10.1109/jsen.2021.3082180 doi: 10.1109/jsen.2021.3082180
[27]	Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin transformer: Hierarchical vision transformer using shifted windows, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 10012–10022. https://doi.org/10.48550/arXiv.2103.14030
[28]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al, Attention is all you need, in Advances in Neural Information Processing Systems, 30 (2017).
[29]	Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278–2324. http://doi.org/10.1109/5.726791 doi: 10.1109/5.726791
[30]	H. Pashler, J. C. Johnston, E. Ruthruff, Attention and performance, Ann. Rev. Psychol., 52 (2001), 629. https://doi.org/10.1146/annurev.psych.52.1.629
[31]	J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, et al., Deep high-resolution representation learning for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell., 43 (2021), 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686
[32]	T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, et al., Microsoft coco: Common objects in context, in European Conference on Computer Vision, (2014), 740–755. http://doi.org/10.1007/978-3-319-10602-1_48
[33]	M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2d human pose estimation: New benchmark and state of the art analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2014), 3686–3693. http://doi.org/10.1109/cvpr.2014.471
[34]	B. Sapp, B. Taskar, Modec: Multimodal decomposable models for human pose estimation, in 2013 IEEE Conference on Computer Vision and Pattern Recognition, (2013), 3674–3681. http://doi.org/10.1109/cvpr.2013.471
[35]	B. Kwolek, M. Kepski, Human fall detection on embedded platform using depth maps and wireless accelerometer, Comput. Methods Programs Biomed., 117 (2014), 489–501. http://doi.org/10.1016/j.cmpb.2014.09.005 doi: 10.1016/j.cmpb.2014.09.005
[36]	K. Adhikari, H. Bouchachia, H. Nait-Charif, Activity recognition for indoor fall detection using convolutional neural network, in 2017 Fifteenth IAPR International Conference on Machine Vision Applications, MVA, (2017), 81–84. http://doi.org/10.23919/mva.2017.7986795
[37]	L. Martínez-Villaseñor, H. Ponce, J. Brieva, E. Moya-Albor, J. Núñez-Martínez, C. Peñafort-Asturiano, UP-fall detection dataset: A multimodal approach, Sensors, 19 (2019), 988. https://pubmed.ncbi.nlm.nih.gov/31035377
[38]	H. Yhdego, J. Li, S. Morrison, M. Audette, C. Paolini, M. Sarkar, et al., Towards musculoskeletal simulation-aware fall injury mitigation: transfer learning with deep CNN for fall detection, in 2019 Spring Simulation Conference (SpringSim), (2019), 1–12. http://doi.org/10.22360/springsim.2019.msm.015
[39]	H. Sadreazami, M. Bolic, S. Rajan, Fall detection using standoff radar-based sensing and deep convolutional neural network, IEEE Trans. Circuits Syst. Ⅱ Express Briefs, 67 (2020), 197–201. http://doi.org/10.1109/tcsii.2019.2904498 doi: 10.1109/tcsii.2019.2904498
[40]	A. Núñez-Marcos, G. Azkune, I. Arganda-Carreras, Vision-based fall detection with convolutional neural networks, Wireless Commun. Mobile Comput., 2017 (2017). https://doi.org/10.1155/2017/9474806 doi: 10.1155/2017/9474806
[41]	S. Chhetri, A. Alsadoon, T. Al-Dala'in, P. W. C. Prasad, T. A. Rashid, A. Maag, Deep learning for vision-based fall detection system: Enhanced optical dynamic flow, Comput. Intell., 37 (2020), 578–595. http://doi.org/10.1111/coin.12428 doi: 10.1111/coin.12428
[42]	C. Khraief, F. Benzarti, H. Amiri, Elderly fall detection based on multi-stream deep convolutional networks, Multimedia Tools Appl., 79 (2020), 19537–19560. http://doi.org/10.1007/s11042-020-08812-x doi: 10.1007/s11042-020-08812-x
[43]	N. Lu, Y. Wu, L. Feng, J. Song, Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data, IEEE J. Biomed. Health Inf., 23 (2019), 314–323. http://doi.org/10.1109/jbhi.2018.2808281 doi: 10.1109/jbhi.2018.2808281
[44]	H. Li, C. Li, Y. Ding, Fall detection based on fused saliency maps, Multimedia Tools Appl., 80 (2020), 1883–1900. http://doi.org/10.1007/s11042-020-09708-6 doi: 10.1007/s11042-020-09708-6
[45]	R. K. Meleppat, M. V. Matham, L. K. Seah, Optical frequency domain imaging with a rapidly swept laser in the 1300 nm bio-imaging window, in International Conference on Optical and Photonic Engineering, (2015), 721–729. http://doi.org/10.1117/12.2190530
[46]	K. M. Ratheesh, L. K. Seah, V. M. Murukeshan, Spectral phase-based automatic calibration scheme for swept source-based optical coherence tomography systems, Phys. Med. Biol., 61 (2016), 7652–7663. http://doi.org/10.1088/0031-9155/61/21/7652 doi: 10.1088/0031-9155/61/21/7652
[47]	R. K. Meleppat, M. V. Matham, L. K. Seah, An efficient phase analysis-based wavenumber linearization scheme for swept source optical coherence tomography systems, Laser Phys. Lett., 12 (2015), 055601. http://doi.org/10.1088/1612-2011/12/5/055601 doi: 10.1088/1612-2011/12/5/055601
[48]	R. K. Meleppat, C. R. Fortenbach, Y. Jian, E. S. Martinez, K. Wagner, B. S. Modjtahedi, et al. In Vivo imaging of retinal and choroidal morphology and vascular plexuses of vertebrates using swept-source optical coherence tomography, Transl. Vision Sci. Technol., 11 (2022), 11. https://pubmed.ncbi.nlm.nih.gov/35972433
[49]	V. M. Murukeshan, L. K. Seah, C. Shearwood, Quantification of biofilm thickness using a swept source based optical coherence tomography system, in International Conference on Optical and Photonic Engineering, (2015), 683–688. http://doi.org/10.1117/12.2190106

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)