Optimized neural network based sliding mode control for quadrotors with disturbances

Ping Li; Zhe Lin; Hong Shen; Zhaoqi Zhang; Xiaohua Mei; Ping Li; Zhe Lin; Hong Shen; Zhaoqi Zhang; Xiaohua Mei

doi:10.3934/mbe.2021092

Mathematical Biosciences and Engineering

2021, Volume 18, Issue 2: 1774-1793. doi: 10.3934/mbe.2021092

Previous Article Next Article

Research article Special Issues

Optimized neural network based sliding mode control for quadrotors with disturbances

College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China

Received: 10 November 2020 Accepted: 28 January 2021 Published: 19 February 2021

In this paper, optimized radial basis function neural networks (RBFNNs) are employed to construct a sliding mode control (SMC) strategy for quadrotors with unknown disturbances. At first, the dynamics model of the controlled quadrotor is built, where some unknown external disturbances are considered explicitly. Then SMC is carried out for the position and the attitude control of the quadrotor. However, there are unknown disturbances in the obtained controllers, so RBFNNs are employed to approximate the unknown parts of the controllers. Furtherly, Particle Swarm optimization algorithm (PSO) based on minimizing the absolute approximation errors is used to improve the performance of the controllers. Besides, the convergence of the state tracking errors of the quadrotor is proved. In order to exposit the superiority of the proposed control strategy, some comparisons are made between the RBFNN based SMC with and without PSO. The results show that the strategy with PSO achieves quicker and smoother trajectory tracking, which verifies the effectiveness of the proposed control strategy.

Keywords:

quadrotor,
sliding mode control (SMC),
radial basis function neural network (RBFNN),
particle swarm optimization (PSO),
disturbance

Citation: Ping Li, Zhe Lin, Hong Shen, Zhaoqi Zhang, Xiaohua Mei. Optimized neural network based sliding mode control for quadrotors with disturbances[J]. Mathematical Biosciences and Engineering, 2021, 18(2): 1774-1793. doi: 10.3934/mbe.2021092

Related Papers:

[1]	Rongxing Qin, Lijuan Huang, Wei Xu, Qingchun Qin, Xiaojun Liang, Xinyu Lai, Xiaoying Huang, Minshan Xie, Li Chen . Identification of disulfidptosis-related genes and analysis of immune infiltration characteristics in ischemic strokes. Mathematical Biosciences and Engineering, 2023, 20(10): 18939-18959. doi: 10.3934/mbe.2023838
[2]	Yonghua Xue, Yiqin Ge . Construction of lncRNA regulatory networks reveal the key lncRNAs associated with Pituitary adenomas progression. Mathematical Biosciences and Engineering, 2020, 17(3): 2138-2149. doi: 10.3934/mbe.2020113
[3]	Ming-Xi Zhu, Tian-Yang Zhao, Yan Li . Insight into the mechanism of DNA methylation and miRNA-mRNA regulatory network in ischemic stroke. Mathematical Biosciences and Engineering, 2023, 20(6): 10264-10283. doi: 10.3934/mbe.2023450
[4]	Yong Ding, Jian-Hong Liu . The signature lncRNAs associated with the lung adenocarcinoma patients prognosis. Mathematical Biosciences and Engineering, 2020, 17(2): 1593-1603. doi: 10.3934/mbe.2020083
[5]	Lishui Shen, Xiaofeng Hu, Ting Chen, Guilin Shen, Dong Cheng . Integrated network analysis to explore the key mRNAs and lncRNAs in acute myocardial infarction. Mathematical Biosciences and Engineering, 2019, 16(6): 6426-6437. doi: 10.3934/mbe.2019321
[6]	Ziyu Wu, Sugui Wang, Qiang Li, Qingsong Zhao, Mingming Shao . Identification of 10 differently expressed lncRNAs as prognostic biomarkers for prostate adenocarcinoma. Mathematical Biosciences and Engineering, 2020, 17(3): 2037-2047. doi: 10.3934/mbe.2020108
[7]	Sisi Qi, Youyu Sheng, Ruiming Hu, Feng Xu, Ying Miao, Jun Zhao, Qinping Yang . Genome-wide expression profiling of long non-coding RNAs and competing endogenous RNA networks in alopecia areata. Mathematical Biosciences and Engineering, 2021, 18(1): 696-711. doi: 10.3934/mbe.2021037
[8]	Yun-xiang Li, Shi-ming Wang, Chen-quan Li . Four-lncRNA immune prognostic signature for triple-negative breast cancer. Mathematical Biosciences and Engineering, 2021, 18(4): 3939-3956. doi: 10.3934/mbe.2021197
[9]	Jie Chen, Jinggui Chen, Bo Sun, Jianghong Wu, Chunyan Du . Integrative analysis of immune microenvironment-related CeRNA regulatory axis in gastric cancer. Mathematical Biosciences and Engineering, 2020, 17(4): 3953-3971. doi: 10.3934/mbe.2020219
[10]	Beibei Zhu, Yue Mao, Mei Li . Identification of functional lncRNAs through constructing a lncRNA-associated ceRNA network in myocardial infarction. Mathematical Biosciences and Engineering, 2021, 18(4): 4293-4310. doi: 10.3934/mbe.2021215

Abstract

1. Introduction

The continuous development of deep learning technology has had a significant impact on research in various fields. For instance, in the field of biomedicine, automatic diagnostic techniques based on deep learning have emerged, enabling image recognition and assisting healthcare professionals in the diagnosis and subsequent procedures ^[1,2,3,4,5]. Furthermore, deep learning has demonstrated a superior performance in scenarios with larger datasets, such as multi-view clustering ^[6,7,8,9,10]. In order to store and learn from a vast amount of information, knowledge graphs (KG) have emerged.

A knowledge graph can be conceptualized as a large-scale semantic integration network, which represents entities as nodes and relationships as directed edges; thus, it stores a vast amount of human knowledge in the form of a directed graph. The resource description framework (RDF) provides a standard framework for KG representation, wherein fact triples (head, relationship, tail) are employed to describe knowledge ^[11]. The KG is capable of storing a rich amount of information regarding real-world entities and their relationships and can enable a range of reasoning processes across the graph. The graph-based approach to data processing has demonstrated a superior performance in tasks such as assisting information retrieval, question-answering systems, and recommendation systems, when compared to traditional structured data ^[12,13]. However, due to the infinite and constantly evolving nature of real-world knowledge, the incompleteness of the KG has led to the task of knowledge graph completion (KGC).

In the field of natural language processing (NLP), KGC techniques can be broadly categorized into three types: rule-based models, path-based models, and embedding-based models. Rule-based models tend to retain the original semantic information more completely, and therefore offer better interpretability. Path-based models make a better use of and represent the graph structure, enabling guided reasoning through various path-searching mechanisms. Both of these approaches are more interpretable, though their expressiveness is limited by model constraints, and their spatiotemporal complexity is higher. Compared to the first two types of models, embedding-based models typically offer greater expressiveness. With the development of graph neural networks (GNNs), GNN-based models have shown great potential in various graph-based tasks, providing additional ideas for KGC. In recent years, KG has also been studied in computer vision, such as in the context of scene graphs and language and image integration.

In recent years, multi-modal knowledge graphs (MKG) have gained significant attention as an extension to traditional knowledge graphs based on a single modality. MKGs typically augment semantic KGs with additional modality data, such as visual and audio attributes, to provide more physically rich representations of the world ^[14,15,16], as illustrated in Figure 1. For a given entity in the knowledge graph, we can use both image and text descriptions to supplement more detailed information that cannot be captured solely by the graph structure. Unfortunately, due to the lack of accumulated multi-modal corpora, existing MKGs often suffer from more severe incompleteness compared to traditional KGs, which greatly reduces their utility and effectiveness. In the task of multi-modal knowledge graph completion (MKGC), we must consider both the issues of multi-modal information fusion and the accuracy and interpretability of knowledge graph completion. In terms of multi-modal information fusion, we need to address issues such as semantic alignment, noise reduction or attenuation, and the realization of unified embeddings. In the process of link prediction, we must not only leverage the semantic richness of multi-modal information to improve accuracy but also enhance the logicality of the algorithm and improve its interpretability ^[17,18].

Figure 1. A simple multi-modal knowledge graph example.

DownLoad: Full-Size Img PowerPoint

Despite the abundance of existing image-text embedding pre-training models, these models often focus on a single pair of corresponding images and text and fail to consider the distinctive structural features of KGs. Therefore, our research builds upon MKGs that contain image-text feature information. In addition to integrating embeddings from different modalities, we also retain local graph features and introduce path features to enhance the interpretability of the reasoning model. Specifically, we propose a method that first utilizes separate modality encoders to learn image and text embeddings, followed by an irrelevant filtering layer to further select semantically relevant key features. Next, we fuse and encode information from different modalities to obtain a multi-modal representation. We then use graph convolution algorithms and path features to extract structural features, and use a scoring function to predict missing triples. Our innovation can be summarized as follows:

1) Designed a structure for extracting image-text information through single-modality encoding, followed by interaction fusion, and improved the semantic similarity through an irrelevant filtering module, thereby enhancing the fusion understanding of different modalities;

2) Proposed a structure feature learning scheme that combines graph convolution and path embedding, thereby enhancing interpretability during the reasoning process;

3) Achieved better results on two public datasets, FB15K-237-IMG and WN18-IMG.

2. Related work

2.1. Knowledge graph completion

The task of knowledge graph completion has been widely studied, with typical sub-tasks including link prediction, entity prediction, and relation prediction, aimed at predicting missing triples (head, relation, tail) in the knowledge graph. Rule-based models such as AMIE and RLvLR utilize symbolic features to perform reasoning through either rule mining or rule searching algorithms ^[19,20]. NeuralLP introduced dynamic programming and further optimized rule mining through attention mechanisms and auxiliary memory ^[21]. Path-based models focus more on the paths between queried head and tail entities, and algorithms such as the path ranking algorithm (PRA) and random walks have been applied and further explored in such models. RNNPRA uses recurrent neural networks (RNN) to better learn path features for reasoning tasks ^[22]. DIVA proposed a unified reasoning framework that divides multi-hop reasoning into a path search and path inference steps ^[23]. The continuous development of deep reinforcement learning (DRL) techniques has enabled more effective multi-hop reasoning in sparse graphs. A series of models such as DeepPath and MultiHop have achieved more effective path exploration by designing new reward mechanisms ^[24,25].

Currently, the more mainstream methods for solving KGC problems are focused on embedding-based models. Translation-based models such as TransE, TransR, and TransH embed entities and their relations by projection, and use a distance function to score the factual triplets ^[26,27,28]. Tensor factorization models such as RESCAL, Tucker, and LowFER use vectors to capture latent semantics through tensor decomposition and continuously improve model efficiency while reducing model size ^[29,30,31]. With the continuous improvement in neural networks (NN) in learning and expressing knowledge, additional embedding-based models choose to use neural network architectures to implement KGC. NTN uses neural tensor networks for relation reasoning in KG ^[32]. ConvE learns deeper features using two-dimensional convolutional layers ^[33]. InteractE processes more complex semantic information and KG interactions through multiple operations such as feature reshaping, feature permutation, and recurrent convolution ^[34]. Although CNN-based KGR models generally perform better than traditional NN models, the feature information contained in the graph structure itself has not been well utilized. Therefore, GNNs have been introduced into the KGC field to perform more complex reasoning tasks based on graph structure features. RGCN encodes each entity into a vector, uses specific transformations to aggregate neighborhood information for different relationship categories, and then reproduces facts through a decoder ^[35]. SACN uses weighted graph convolutional networks (WGCN) to implement the encoder, and then inputs the encoded information into a convolutional network for decoding ^[36]. NBF-Net and RED-GNN improve on traditional algorithms, choosing Bellman-Ford algorithms and dynamic programming to optimize the propagation strategy in previous GNN models, and achieve efficiency improvements ^[37,38].

2.2. Multi-modal task

The traditional tasks in the two major fields of computer vision (CV) and natural language processing (NLP) have been extensively discussed, and more recent research has focused on cross-modal problems. The optimization and development of the Transformer model has led to a series of explorations into visual-text pre-training frameworks. VisualBERT is considered to be the first image-text pre-training model, which uses Faster R-CNN to extract visual features and connects them with text embeddings, which are then input into a transformer initialized by BERT ^[39]. Inspired by the feature extraction and architecture in the VisualBERT model, more pre-training models have been proposed by adjusting the pre-training tasks and datasets. CLIP uses a dataset of 400 million image-text pairs for pre-training, learning representations by directly matching raw text and corresponding images ^[40]. METER further explores single-modal feature extraction and processes multi-modal fusion using a dual-stream architecture model, achieving excellent performance on many downstream tasks ^[41].

Numerous excellent multi-modal pretraining models have adopted the masked language modeling (MLM), masked visual modeling (MVM), and visual-linguistic matching (VLM) tasks as pretraining objectives; their corresponding downstream tasks are mainly focused on works that deal with the meaning and relationships between text and images, such as visual question answering (VQA), visual commonsense reasoning (VCR), and visual captioning (VC). However, for KGs, their distinguishing feature from semantically structured information is their graph structure. Recently, some studies have recognized the importance of structural features for handling KG-related tasks. DRAGON proposes a deep bidirectional, self-supervised pretraining method for language knowledge models from text and KGs ^[42]. Knowledge-CLIP takes entities and relations in KGs as inputs and extracts the original features of these entities and relations ^[43]. Entities can be in the form of images/text, while relations are described using language tokens. These pretraining models with structural features provide better options for MKG-related tasks.

2.3. Multi-modal knowledge graph completion

As an emerging research field, related work in MKGC is not yet systematic, and early MKGC tasks often directly added image information to the input of the original KGR model, which usually led to a suboptimal performance. To address this issue, many studies have made more attempts and explorations in the field of image-text feature fusion in MKG.

IKRL first proposed an attention-based neural network to consider visual information in entity images ^[44]. TransAE introduced a KG representation learning method that integrates multi-channel (visual and language) information in a translation-based framework, and extended the definition of triple energy to consider new multi-channel representations ^[45]. MKBE and MRCGN integrated different neural encoders and decoders with relation models to embed learning and multi-modal data for inference ^[14,46]. MarT constructed a multi-channel analogical reasoning framework based on structural mapping theory to improve model interpretability ^[47]. MMKGR used a unified gate attention network to perform an attention interaction and to filter noise for generating more effective and reliable multi-modal complementary feature encoding, and designed a new reinforcement learning framework to predict missing elements in multi-hop reasoning processes ^[16]. MM-RNS proposed a multi-channel relation-enhanced negative sampling framework that provides bidirectional attention between visual and text features by integrating relation embeddings, and combined it with contrastive learning to construct an effective contrastive semantic sampler to improve MKGC performance ^[48].

We have conducted a brief overview of the related models in traditional and multimodal KGs, as shown in Table 1.

Table 1. Summarization of existing KGC models.

		Knowledge Graph Completion	Multi-Modal Knowledge Graph Completion
Rule-based Models		AMIE, RLvLR, NeuralLP	-
Path-based Models		RNNPRA, DIVA, DeepPath, MultiHop	-
Embedding-based Models	Translational Models	TransE, TransR, TransH	TransAE
	Tensor Decompositional Models	RESCAL, Tucker, LowFER	-
	Neural Network Models	NTN, ConvE, InteractE, RGCN, SACN, NBF-Net, RED-GNN	IKRL, MKBE, MRCGN, MarT, MMKGR, MM-RNS

| Show Table

DownLoad: CSV

In order to provide a clearer demonstration of the effectiveness of the aforementioned work, we have provided a more detailed comparative analysis of selected algorithms in Table 2.

Table 2. Model performance comparison.

Models	Dataset	Technique	Performance(Hits@10)
RLvLR	FB75K	Logic rule	43.4
MultiHop	FB15k-237	Relation path	56.4
TransE	FB15k-237	Translational	47.1
LowFER	FB15k-237	Tensor decompositional	54.4
RED-GNN	FB15k-237	GNN	55.8
TransAE	WN9-IMG	Translational	94.84
MMKGR	WN9-IMG	Attention	92.8

| Show Table

DownLoad: CSV

3. Problem formulation

The knowledge graph $\mathcal{G} = \left\{\mathcal{E}, \mathcal{R}, \mathcal{F} \right\}$ is a directed graph, where $\mathcal{E}$ is the entity set, $\mathcal{R}$ is the relation set, and $\mathcal{F} = \left\{\left (h, r, t \right)|h\in \mathcal{E}, t\in \mathcal{E}, r\in \mathcal{R} \right\}$ is the fact set consisting of fact triples $\left (h, r, t \right)$ . The head entity $h\in \mathcal{E}$ and tail entity $t\in \mathcal{E}$ are connected by a relation $r\in \mathcal{R}$ . For a multi-modal knowledge graph $\mathcal{G}$ , the entity $e$ includes two modalities, namely textual information $e^{t}$ and visual information $e^{v}$ .

The purpose of multi-modal KGC is to infer incomplete triplets $\mathcal{T} = \left\{\left (h, r, t \right)|h\in \mathcal{E}, t\in \mathcal{E}, r\in \mathcal{R}, \left (h, r, t \right)\notin \mathcal{F} \right\}$ based on known fact triplets $\left (h, r, t \right)$ . In practice, the incomplete triplets that may appear in our prediction task can take three forms, namely $\left (h, r, ? \right)$ , $\left (h, ?, t \right)$ , and $\left (?, r, t \right)$ . In the implementation process, we input the feature information of entities $e$ and relationships $r$ into an encoder to obtain the corresponding embedding vectors ${\bf h}$ , ${\bf r}$ , ${\bf t}$ . Then, we use a scoring function $f\left ({\bf h}, {\bf r}, {\bf t} \right)$ to evaluate the probability of the truthfulness of inferred triplets. That is, when triplet $\left (h, r, t \right)\in \mathcal{G}$ is true, $f\left ({\bf h}, {\bf r}, {\bf t} \right)$ scores 1, otherwise, when $\left (h, r, t \right)\notin \mathcal{G}$ is true, $f\left ({\bf h}, {\bf r}, {\bf t} \right)$ scores 0. Taking a missing triplet in the form of $\left (h, ?, t \right)$ as an example, let us assume the existence of a relationship $r_{pd}$ between the head entity $h$ and the tail entity $t$ , thereby obtaining the complete triplet $\left (h, r_{pd}, t \right)$ with an unknown truthfulness. To evaluate the probability of its actual occurrence, we employ a scoring function, resulting in the output $f\left ({\bf h}, {\bf r}_{pd}, {\bf t} \right)$ . The basic terminology definitions are shown in Table 3.

Table 3. Notation summary.

Notation	Explanation
$\mathcal{G}$	Multi-modal knowledge graph
$\mathcal{E}$	Entity set
$\mathcal{R}$	Relation set
$\mathcal{F}$	Fact set
$\mathcal{T}$	Incomplete fact set
$\left (h, r, t \right)$	Fact triplet of the head, relation, tail
${\bf h}$	Embedding of head entity
${\bf t}$	Embedding of tail entity
${\bf r}$	Embedding of relation entity

| Show Table

DownLoad: CSV

4. Methodology

The model we proposed, MLSFF, has an overall architecture shown in Figure 2, which consists of three components: 1) single-modality encoders for image and text embedding; 2) a multi-modal feature fusion mechanism with irrelevant filtering to discard interfering information and to reduce noise when the image and text features interact with each other; 3) a reasoning framework that combines the graph structure and path features, introduces a new scoring function containing multi-hop path features, and uses multi-modal features to predict incomplete triplets in KGC processes.

Figure 2. Overview of our model structure.

DownLoad: Full-Size Img PowerPoint

4.1. Single modal encoder

The emergence of the Transformer model has caused a huge revolution in the NLP field and has been widely used in various tasks. The attempt to introduce the Transformer model into the CV field has not only achieved success, but even achieved astonishing results. Specifically when the pre-training data is large enough, Transformer's performance in CV will be significantly better than CNN, breaking the limitation of the original few inductive biases, and achieving better transfer effects in downstream tasks. We use independent image encoders and text encoders based on the Transformer architecture to extract features from the raw inputs. For a given triple, the entity and relation are sent to the corresponding encoder based on their modality (image or text). The relation represented by language tokens is sent to the text encoder similar to the text entity. The main architecture of our single-modality encoder is illustrated in Figure 3.

Figure 3. Structure of single modal encoder.

DownLoad: Full-Size Img PowerPoint

Visual Encoder For image feature extraction, we adopt the embedding layer and Transformer encoder of the pre-trained model ViT as the main architecture ^[49]. Let $C$ be the number of channels in the image (in RGB images, $C = 3$ ) and the resolution of each image patch be $(P, P)$ . First, we scale the input image $I$ to a unified resolution $(A, B)$ , and then divide it into $N = AB/P^{2}$ patches. We use a linear mapping (i.e., FC layer) to transform each patch into a one-dimensional vector. This completes the embedding of the original image $X_{pat}^{v}$ . Subsequently, we feed the obtained image embedding and position embedding $X_{pos}^{v}$ into the Transformer encoder as an input. The overall forward calculation process is as follows:

$\begin{equation} X_{0}^{v} = X_{pat}^{v}+X_{pos}^{v} \end{equation}$

(4.1)

$\begin{equation} \hat{X}_{l}^{v} = {\bf MSA}\left ( {\bf LN} \left ( X_{l-1}^{v}\right ) \right )+X_{l-1}^{v}, \, l = 1, 2, ..., L^{v} \end{equation}$

(4.2)

$\begin{equation} X_{l}^{v} = {\bf FFN}\left ( {\bf LN} \left ( \hat{X}_{l}^{v}\right ) \right )+\hat{X}_{l}^{v}, \, l = 1, 2, ..., L^{v} \end{equation}$

(4.3)

The MSA Block consists of a multi-head attention mechanism, a layer normalization, and a skip connection (Layer Norm & Add), which can be repeated for $L^{v}$ times, and the output of the $l-$ th block is $\hat{X}_{l}^{v}$ . The MLP Block consists of feedforward neural network, layer normalization, and skip connection (Layer Norm & Add), which can be repeated for $L^{v}$ times, and the output of the $l-$ th block is $X_{l}^{v}$ .

Textual Encoder In NLP tasks, a large number of pre-training models based on the Transformer architecture have emerged, such as BERT, which has recently been widely applied and demonstrated great success in various downstream tasks ^[50,51]. In this paper, we use BERT to perform language modeling and feature extraction. Specifically, we divide the complete sentence into a word sequence and perform word embedding to obtain the word embeddings $X_{word}^{t}$ . In order to preserve sentence-level features, we also embed the entire sentence and align it with the word embeddings to obtain the sentence embeddings $X_{sen}^{t}$ . Then, we send the word embeddings $X_{word}^{t}$ , position embeddings $X_{pos}^{t}$ , and sentence embeddings $X_{sen}^{t}$ to the encoder.

$\begin{equation} X_{0}^{t} = X_{word}^{t}+X_{sen}^{v}+X_{pos}^{v} \end{equation}$

(4.4)

$\begin{equation} \hat{X}_{l}^{t} = {\bf LN}\left ( {\bf MSA} \left ( X_{l-1}^{t}\right ) \right )+X_{l-1}^{t}, \, l = 1, 2, ..., L^{t} \end{equation}$

(4.5)

$\begin{equation} X_{l}^{t} = {\bf LN}\left ( {\bf FFN} \left ( \hat{X}_{l}^{t}\right ) \right )+\hat{X}_{l}^{t}, \, l = 1, 2, ..., L^{t} \end{equation}$

(4.6)

The difference between text encoding and visual encoding is that layer normalization (LN) is located after the multi-head self-attention (MSA) and feed-forward network (FFN) layers. Similarly, the output of the $l-$ th MSA block is denoted as $\hat{X}_{l}^{t}$ and the output of the $l-$ th MLP block is denoted as $X_{l}^{t}$ . We denote the number of MSA and MLP blocks in the text encoder as $L^{t}$ .

4.2. Multimodal feature fusion

In the multimodal fusion module, we fuse the separately encoded text and image information. Specifically, since relationships belong to a separate data category with certain label information, although they are usually described using text, their semantic relevance to the text and image descriptions of entities is relatively low. Therefore, we choose to fuse and filter the image and text information separately for relationships, and then introduce the encoded relationship attributes when learning the path features.

To enhance the efficiency of the semantic interaction between the two different modalities of image and text, we adopt an intermediate representation to unify the multimodal information. On one hand, we aim to achieve a more fine-grained interaction between different modal feature information; on the other hand, since images often contain semantically irrelevant information, directly using the complete image embedding in the feature fusion process may introduce noise. Therefore, we feed the learned image and text vectors into a multimodal gated unit for weight learning to achieve the intermediate feature representation.

$\begin{equation} g_{f} = \sigma \left ( X^{v}W^{v}\odot X^{t}W^{t} \right ) \end{equation}$

(4.7)

$\begin{equation} \hat{X}^{m} = g_{f}X^{v}+\left ( 1-g_{f} \right )X^{t} \end{equation}$

(4.8)

In this equation, $\sigma$ represents the sigmoid function, $X^{v}$ and $X^{t}$ denote the feature vectors outputted by the image and text encoders, respectively, $W^{v}$ and $W^{t}$ are parameter matrices, $g_{f}$ is a scalar within the range of $[0, 1]$ , $\hat{X}^{m}$ represents the multi-modal embedding vector obtained through the filtering layer, and $\odot$ denotes the element-wise multiplication (i.e., Hadamard product).

Later, we feed the original embeddings $\hat{X}^{m}$ into the multi-modal encoder to further learn the semantic features.

$\begin{equation} X = Tran \left ( \hat{X}^{m} \right ) \end{equation}$

(4.9)

4.3. Prediction block

We have obtained the multi-modal feature embedding of a certain fact description through the previous structure, but this is insufficient for large-scale and complex KGs. Hence, we aim to further learn path features to better accomplish the task of KGC. The overall approach regarding the learning of structural features and completion can be summarized as follows. First, we extract a certain path existing in the MKG, connect the relations in the path, and then divide the path into several shorter components through a sliding window. Then, we select one of the components and use a recurrent attention unit to embed the selected component to obtain a relation vector, which is represented as a weighted combination of existing relations. We recursively merge the divided components of the path, and finally use a scoring function to determine the truthfulness of unknown triplets. The overall process of the prediction block shows in Algorithm 1.

Algorithm 1 Prediction block
Input: the path body ${\bf r}_{p}$
Output: the score of triplet $f\left ({\bf h}, {\bf r}, {\bf t} \right)$
1: Initialize the window size $w$
2: for all $i = 1, 2, ..., n-1$ do
3: get path segments $w = \left\{1, 2, 3 \right\}$ and encoding with LSTM $\left [\hat{y}_{i}, \hat{y}_{i+1} \right] = {\bf LSTM}\left (w_{i} \right)$ ;
4: ${\bf y}_{i} = \hat{y}_{i+1}$
5: end for
6: $\mu = softmax\left (\left [{\bf FC}\left ({\bf y}_{1} \right), {\bf FC}\left ({\bf y}_{2} \right), ..., {\bf FC}\left ({\bf y}_{n+1-w} \right) \right] \right)$
7: $Y = \sum_{i = 1}^{n+1-w}\mu _{i}{\bf y}_{i}$
8: $f\left ({\bf h}, {\bf r}, {\bf t} \right) = \sigma \left (vec\left (\left (\left [X_{h}; Y \right]\ast \omega \right) W\right)X_{t} \right)$
9: return $f\left ({\bf h}, {\bf r}, {\bf t} \right)$

Sliding Window Segmentation To extract fine-grained features from sampled paths, we decompose the sampled paths into combinations of different sizes using sliding windows of varying lengths. In the implementation, we use windows of size $w = \left\{1, 2, 3 \right\}$ . Given the window size, the generated sliding windows traverse the path body ${\bf r}_{p} = \left [r_{p_{1}}, ..., r_{p_{n}} \right]$ . Then, we use a long short-term memory (LSTM) network as a sequence encoder to conceal the information within the sliding windows. Taking the sliding window of length 2 as an example,

$\begin{equation} \left [ \hat{y}_{i}, \hat{y}_{i+1} \right ] = {\bf LSTM}\left ( w_{i} \right ) \end{equation}$

(4.10)

Since the final state $y_{i+1}$ usually contains the complete information of the sequence, we select ${\bf y}_{i} = \hat{y}_{i+1}$ . ${\bf y}_{i}$ is meaningful to learn the relationship in the window if the relationship segments in the $i-$ th sliding window always appear together in some combination, which is more likely to represent a real "long-distance" relationship. To incorporate this observation into our model, we calculate the probability value of these relationship segments by:

$\begin{equation} \mu = softmax\left ( \left [ {\bf FC}\left ( {\bf y}_{1} \right ), {\bf FC}\left ( {\bf y}_{2} \right ), ..., {\bf FC}\left ( {\bf y}_{n+1-w} \right ) \right ] \right ) \end{equation}$

(4.11)

where ${\bf FC}\left (\cdot \right)$ represents a fully connected layer, which is used to learn the probability that the $i-$ th window in ${\bf y}_{i}$ represents a meaningful relationship fragment. Finally, we calculate the weighted sum of information from different windows to represent the complete path features:

$\begin{equation} Y = \sum\limits_{i = 1}^{n+1-w}\mu _{i}{\bf y}_{i} \end{equation}$

(4.12)

Scoring Function Considering the excellent performance of graph convolutional models in handling KGC problems, we choose the following scoring function:

$\begin{equation} f\left ( {\bf h}, {\bf r}, {\bf t} \right ) = \sigma \left ( vec\left ( \left ( \left [ X_{h};Y \right ]\ast \omega \right ) W\right )X_{t} \right ) \end{equation}$

(4.13)

In the proposed scoring function, $X_{h}$ and $X_{t}$ represent the multi-modal embeddings of the head and tail entities, respectively, while $Y$ represents the embedding of their relationship, $\ast$ and $\omega$ denote the convolution operation and the convolution kernel, respectively, and $vec\left (\cdot \right)$ represents the projection operation from the feature map to the vector space, $W$ is a parameter matrix. With the above method, we can compute whether a fact constructed by a certain relationship between two entities is true or not.

For ease of reference, we summarize the main symbol notations used in this chapter in Table 4.

Table 4. Notation summary.

Notation	Explanation
$X$	Embedding entity vector
$\hat{X}$	Intermediate state of the embedding entity
$\hat{y}$	Intermediate state of the embedding path
${\bf y}_{i}$	Embedding path vector
$Y$	Encoded Complete path vector
$W$	Parameter Matrix

| Show Table

DownLoad: CSV

5. Experiment

5.1. Dataset

We evaluate the effectiveness of the MLSFF model on two publicly available datasets: (ⅰ) FB15K-237-IMG: a subset of the large-scale knowledge graph Freebase, where each entity has 10 images, and is a commonly used dataset in KGC tasks; (ⅱ) WN18-IMG: WN18 is a knowledge graph extracted from WordNet. WN18-IMG is an extended dataset of WN18, where each entity has 10 images ^[52]. These two datasets can be obtained as FB15k-WN18-images. Table 5 shows the statistical information of the datasets.

Table 5. Statistics of datasets.

Datasets	#Entities	#Relations	#Train	#Dev	#Test
FB15k-237-IMG	14,541	237	272,115	17,535	20,466
WN18-IMG	40,943	18	141,442	5000	5000

| Show Table

DownLoad: CSV

5.2. Settings

Evaluation Metrics: We adopted classic knowledge graph completion evaluation metrics, including Hits@k and mean rank (MR), as shown in Table 6.

Table 6. Summarization of evaluation metrics.

Evaluation metrics	Calculation formula
Hits@k	${\rm{Hits@k}}=\sum_{i}\frac{1\left (\mathrm{rank}_{i} \right) < k}{Q}$
MR	$MR=\frac{1}{Q}\sum_{i}\mathrm{rank}_{i}$

| Show Table

DownLoad: CSV

Hits@k: The Hits@k metric is defined as the proportion of true entities that appear in the top-k ranked list of entities. It is calculated as follows:

$\begin{equation} Hits{\mathit{@}}k = \sum\limits_{i}\frac{1\left ( \mathrm{rank}_{i} \right ) < k}{Q} \end{equation}$

(5.1)

where $\mathrm{rank}_{i}$ represents the rank of the expected entity of the $i-$ th incomplete fact triple. $Q$ represents the total number of incomplete fact triples.

Mean Rank (MR): Mean Rank is the arithmetic average of the individual entity ranks, defined as:

$\begin{equation} MR = \frac{1}{Q}\sum\limits_{i}\mathrm{rank}_{i} \end{equation}$

(5.2)

Parameter Configuration To consider the model's scale and computational efficiency, we choose the ViT-B/16 pre-trained model for the image encoder. We set the embedding dimensions for both text and image to 768. The number of layers for both the image and text encoders is set to 12, while the number of layers for the modality encoder is set to 3. The graph embedding dimension is set to 200, and the batch size is set to 64. We utilize the Warmup algorithm and the ADAM optimizer to adjust the learning rate of the model parameters. The initial learning rate is set to 0.0005, and the dropout rate is set to 0.1.

Baseline Setup We selected four unimodal methods and four multi-modal methods as baselines to compare with our proposed model. The unimodal methods include the following: 1) TransE ^[26], a classic translation-based model that encodes entities and relationships into a linear space; 2) DistMult ^[53], which uses a linear neural network to encode a multi-relation graph for multi-relation learning; 3) ComplEx ^[54], which solves both symmetric and asymmetric relations by introducing complex methods; and 4) RotatE ^[55], which defines relations as rotations from the head entity to the tail entity in a complex space to achieve multi-class reasoning. The multi-modal methods include the following: (ⅰ) IKRL (UNION) ^[44], which extends TransE to learn about visual representations of entities and structural features of KGs; (ⅱ) TransAE ^[56], which combines multi-modal encoders with TransE to achieve unified representation of visual and textual features; (ⅲ) RSME ^[57], which uses a forget gate to learn about valuable images for MKG embedding; and (ⅳ) MKGformer ^[52], which proposes an MKG pre-training model based on a hybrid transformer structure.

5.3. Main results

The experimental results on the two datasets are shown in Table 7, which shows that our model generally outperforms the 8 baseline methods.

Table 7. Results of link prediction on FB15k-237-IMG and WN18-IMG.

Model	FB15k-237-IMG				WN18-IMG
Model	Hits@1↑	Hits@3↑	Hits@10↑	MR↓	Hits@1↑	Hits@3↑	Hits@10↑	MR↓
TransE	0.198	0.376	0.441	323	0.40	0.745	0.923	357
DistMult	0.199	0.301	0.466	512	0.335	0.876	0.940	655
ComplEx	0.194	0.297	0.450	546	0.936	0.945	0.947	-
RotatE	0.241	0.375	0.533	177	0.942	0.950	0.957	254
IKRL (UNION)	0.194	0.284	0.458	298	0.127	0.796	0.928	596
TransAE	0.199	0.317	0.463	431	0.323	0.835	0.934	352
RSME	0.242	0.344	0.467	417	0.943	0.951	0.957	223
MKGformer	0.256	0.367	0.504	221	0.944	0.961	0.972	28
MLSFF (ours)	0.274	0.411	0.552	193	0.951	0.973	0.980	22

| Show Table

DownLoad: CSV

Firstly, in all works, the scores on FB15k-237-IMG are generally lower than those on WN18-IMG. The fundamental reason is that the dataset FB15k-237-IMG is more sparse and complex than the dataset WN18-IMG, with a greater variety of relationships between different entities. In addition, our model performs better on Hits@1 than on Hits@3 or Hits@10, indicating a superior discriminative ability in predicting unknown entities. In the MLSFF model, we use two single-modal encoders to extract image and text information, followed by a multi-modal layer for interaction, which enables full learning of semantic information for entity description. We introduce a sliding window in learning the link features, which realizes "scalable" path sampling and to some extent solves the problem of complex graph structures.

Secondly, some traditional single-modal methods, such as RotatE, even outperform architectures that use multi-modal features in overall performance. This suggests that a well-designed relationship decomposition and learning rule are effective in solving complex graph problems, and fully utilizing structural features can improve prediction accuracy. Therefore, after obtaining multi-modal encoding information, our model not only uses the traditional graph convolutional method to obtain neighbor node information, but also incorporates long-distance path features and borrows from recurrent neural network structures used in processing text information to extract left and right node information from selected paths. By adding certain "vertical" features during the convolution process, our prediction model can have better interpretability.

Finally, our model achieved significant improvements of 4.8 and 1.2% on the two datasets, respectively. However, in the FB15k-237-IMG dataset, our model's MR metric results were slightly inferior to those of the RotatE model. This could be attributed to the FB15k-237-IMG dataset containing a larger number of entities and a more diverse set of relationships, resulting in a sparser and more complex knowledge graph. While our model has improved its ability to learn about multi-hop path relationships to some extent, it lacks similar operations on negative samples, as seen in the RotatE model. As a result, this has impacted the overall accuracy. Overall, the experimental results demonstrate that our model outperforms existing methods on most evaluation metrics, with even more significant improvements observed on more complex knowledge graphs. This is because the MLSFF model learns more comprehensive semantic features by fusing information from both image and text modalities, enabling more comprehensive knowledge extraction from the graph. In addition, we employed convolutional operations that capture neighborhood information and an LSTM structure that learns path-level features to achieve a more comprehensive and three-dimensional feature encoding structure for learning graph structural features, which is highly effective for processing large-scale knowledge graphs.

6. Further analysis

6.1. Ablation study

To investigate the actual effects of each component in the MLSFF model, we conducted ablation studies by removing some of the components.

w/o SinE: To investigate the effect of the single-modal encoders on understanding image and text semantics, we aligned the one-dimensional vectorized image patches and text embeddings, calculated their Hadamard product, and directly fed them into the multi-modal encoder for learning.

w/o Flt: To further investigate the actual effect of the unrelated filtering layer, we also experimented with the meaning of the multi-modal fusion module by directly fusing the encoded image and text features without the unrelated filtering layer.

w/o Swin: To demonstrate the positive effect of extracting path information on learning graph structure features, we removed the sliding window encoding module and only used graph convolution operations to obtain structural embeddings.

From Figure 4, it can be seen that using single-modality encoders to extract image and text features can effectively enhance semantic understanding and better learn human knowledge, thereby promoting and improving the performance in KGC tasks. Although image features can assist in text understanding, there is still some noise interference. Filtering out irrelevant information can further enhance the fusion effect between multi-modal features and improve accuracy. In addition, when facing large-scale and complex knowledge graphs, although graph convolutional operations can already fully learn structural information and capture neighbor features, the introduction of path and rule features can further improve model interpretability and prediction ability. Specifically, when dealing with sparse graphs, simple convolutional operations may lead to a certain decrease in accuracy, and learning path features can also help improve model efficiency.

Figure 4. Ablation on different components of the MLSFF.

DownLoad: Full-Size Img PowerPoint

6.2. Hyperparameter analysis

Our connection prediction module is mainly implemented based on the GNN algorithm, which aggregates neighbor information into the target node and then updates the target node based on the integrated information. However, this approach is prone to the problem of over-smoothing, where the representations of different nodes tend to become similar as the number of GNN layers increases during training. To address this issue, we introduce "longer-distance" path embeddings, which combine deep features and breadth features to extract complex graph structure information.

We further explore effective graph processing structures by adjusting the number of convolutional layers and the size of the sliding window. In this work, considering memory and computational capacity, we conduct experiments with sliding window widths ranging from 1 to 3. As shown in Figure 5, the model performs better when the sliding window width is set to 2.

Figure 5. Impacts of the width on FB15k-237 and WN18RR.

DownLoad: Full-Size Img PowerPoint

When the sliding window width is set to 2, our model can learn more layers of graph structural features and neighbor information. When the sliding window width is too small, that is, when the number of subgraphs learned is too few, the information in the knowledge graph cannot be fully aggregated to learn the structural information of the knowledge graph. In addition, some useful high-order neighbors cannot be captured. When the number of subgraphs is too large, the node representation is overly smoothed due to excessive noise.

6.3. Complexity analysis

MLSFF: Denote the entity embedding dimension as $d_{e}$ , the structural embedding dimension as $d_{r}$ , and the number of channels as $T$ . The final output dimension for triplet encoding is denoted as $m\times n$ . The main complexity of our model can be represented as $\mathcal{O} \left (\left | \mathcal{E} \right | d_{e} + \left | \mathcal{R} \right | d_{r} + Tmn + Td\left (2d_{m}-m+1 \right)\left (d_{n}-n+1 \right) \right)$ .

TransE: The scoring function of the TransE model is denoted as $\left \| h+r-t \right \|$ , and as a result, its algorithmic complexity can be represented as $\mathcal{O} \left (\left | \mathcal{E} \right | d + \left | \mathcal{R} \right | d \right)$ .

RED-GNN: As a GNN model in the traditional knowledge graph completion task, the RED-GNN model has an algorithmic complexity denoted as $\mathcal{O} \left (d\cdot \min \left (\bar{D} ^{L}, \left | \mathcal{F} \right | L \right) \right)$ . In this context, $\bar{D}$ represents the average degree of the r-directed graph per layer. It can be observed that our model has a slightly higher computational complexity. This is attributed to two main factors: first, the inherent complexity of multimodal knowledge graphs; and second, the decision to incorporate a more extensive graph feature learning scheme to enhance the interpretability of paths.

6.4. Study limitation

Despite the promising results and contributions of our study, there are some limitations that should be acknowledged:

While our model aims to enhance interpretability by incorporating graph features and multi-hop paths, the interpretability of the model's predictions may still be limited. Explaining the reasoning behind specific predictions or understanding the underlying decision-making processes can be challenging, especially in complex multimodal knowledge graphs.

In addition, the proposed model in this paper exhibits high complexity, which results in increased demands for computational resources and significant time consumption. Furthermore, our model does not consider the possibility of negative samples during the sampling process, which has an impact on the overall accuracy of the prediction task.

7. Conclusions

We propose a MLSFF model which first uses two independent single-modality encoders to obtain pre-trained embeddings for both image and text information. Then, after filtering out irrelevant information, the multi-modal features are fused to obtain a unified encoding vector. We utilize graph convolutional algorithms to learn the structural information in the knowledge graph. In addition, we introduce path-based feature information into the graph structural features to obtain richer relationship representations. Our experimental results demonstrate that our model achieves better performance in MKGC tasks. To address the issues of high complexity and the omission of negative samples in our model, we will focus on the following areas for future research: (ⅰ) designing simpler and more efficient scoring functions that are more streamlined and computationally efficient; (ⅱ) considering negative sample interference, thereby mitigating their impact on the accuracy of the prediction task; (ⅲ) incorporating additional modalities: to achieve a more comprehensive and diverse multimodal fusion such as numerical features and enhancing the overall performance of the model.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China-China State Railway Group Co., Ltd. Railway Basic Research Joint Fund (Grant No.U2268217) and the Scientific Funding for China Academy of Railway Sciences Corporation Limited (No.2021YJ183).

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	N. Fethalla, M. Saad, H. Michalska, J. Ghommam, Robust observer-based dynamic sliding mode controller for a quadrotor UAV, IEEE Access, 6 (2018), 45846–45859. doi: 10.1109/ACCESS.2018.2866208
[2]	A. Levant, Principles of 2-sliding mode design, Automatica, 43 (2007), 576–586.
[3]	R. Xu, Ü. Özgüner, Sliding mode control of a class of underactuated systems, Automatica, 44 (2008), 233–241.
[4]	D. Lee, H. J. Kim and S. Sastry, Feedback linearization vs. adaptive sliding mode control for a quadrotor helicopter, Int. J. Control Autom. Syst., 7(2009), 419–428. doi: 10.1007/s12555-009-0311-8
[5]	E. H. Zheng, J. J. Xiong, J. L. Luo, Second order sliding mode control for a quadrotor UAV, ISA Trans., 53 (2014), 1350–1356. doi: 10.1016/j.isatra.2014.03.010
[6]	J. J. Xiong, E. H. Zheng, Position and attitude tracking control for a quadrotor UAV, ISA Trans., 53 (2014), 725–731. doi: 10.1016/j.isatra.2014.01.004
[7]	J. J. Xiong, G. B. Zhang, Global fast dynamic terminal sliding mode control for a quadrotor UAV, ISA Trans., 66 (2017), 233–240. doi: 10.1016/j.isatra.2016.09.019
[8]	H. Ríos, R. Falcón, O. A. González, A. Dzul, Continuous sliding-mode control strategies for quadrotor robust tracking: real-time application, IEEE Trans. Ind. Electron., 66 (2019), 1264–1272. doi: 10.1109/TIE.2018.2831191
[9]	B. X. Mu, K. W. Zhang and Y. Shi, Integral sliding mode flight controller design for a quadrotor and the application in a heterogeneous multi agent system, IEEE Trans. Ind. Electron., 64 (2017), 9389–9398. doi: 10.1109/TIE.2017.2711575
[10]	H. P. Wang, X. F. Ye, Y. Tian, G. Zheng, N. Christov, Model-free based terminal SMC of quadrotor attitude and position, IEEE Trans. Aerosp. Electron. Syst., 52 (2016), 2519–2528. doi: 10.1109/TAES.2016.150303
[11]	O. Mofid, S. Mobayen, Adaptive sliding mode control for finite time stability of quadrotor UAVs with parametric uncertainties, ISA Trans., 72 (2018), 1–14. doi: 10.1016/j.isatra.2017.11.010
[12]	L. Derafa, A. Benallegue, L. Fridman, Super twisting control algorithm for the attitude tracking of a four rotors UAV, J. Franklin Inst., 349 (2012), 685–699. doi: 10.1016/j.jfranklin.2011.10.011
[13]	M. Boukattaya, N. Mezghani, T. Damak, Adaptive nonsingular fast terminal sliding-mode control for the tracking problem of uncertain dynamical systems, ISA Trans., 77 (2018), 1–19. doi: 10.1016/j.isatra.2018.04.007
[14]	M. Labbadi, M. Cherkaoui, Robust adaptive nonsingular fast terminal sliding-mode tracking control for an uncertain quadrotor UAV subjected to disturbances, ISA Trans., 99 (2020), 290–304. doi: 10.1016/j.isatra.2019.10.012
[15]	B. Wang, X. Yu, L. X. Mu, Y. M. Zhang, Disturbance observer-based adaptive fault-tolerant control for a quadrotor helicopter subject to parametric uncertainties and external disturbances, Mech. Syst. Signal Proc., 120 (2019), 727–743. doi: 10.1016/j.ymssp.2018.11.001
[16]	G. Q. Zhu, S. Wang, L. F. Sun, W. C. Ge, X. Y. Zhang, Output feedback adaptive dynamic surface sliding-mode control for quadrotor UAVs with tracking error constraints, Complexity, 2020 (2020), 1–23.
[17]	M. Labbadi, M. Cherkaoui, Robust adaptive backstepping fast terminal sliding mode controller for uncertain quadrotor UAV, Aerosp. Sci. Technol., 93 (2019), 105306.
[18]	H. Razmi, S. Afshinfar, Neural network-based adaptive sliding mode control design for position and attitude control of a quadrotor UAV, Aerosp. Sci. Technol., 91 (2019), 12–27. doi: 10.1016/j.ast.2019.04.055
[19]	Y. M. Chen, Y. L. He, M. F. Zhou, Decentralized PID neural network control for a quadrotor helicopter subjected to wind disturbance, J. Cent. South Univ., 22 (2015), 168–179. doi: 10.1007/s11771-015-2507-9
[20]	S. S. Li, Y. N. Wang, J. H. Tan, Y. Zheng, Adaptive RBFNNs/integral sliding mode control for a quadrotor aircraft, Neurocomputing., 216 (2016), 126–134. doi: 10.1016/j.neucom.2016.07.033
[21]	O. Doukhi, D. J. Lee, Neural network-based robust adaptive certainty equivalent controller for quadrotor UAV with unknown disturbances, Int. J. Control Autom. Syst., 17 (2019), 2365–2374. doi: 10.1007/s12555-018-0720-7
[22]	Q. Z. Xu, Z. S. Wang, Z. Y. Zhen, Adaptive neural network finite time control for quadrotor UAV with unknown input saturation, Nonlinear Dyn., 98 (2019), 1973–1998. doi: 10.1007/s11071-019-05301-1
[23]	C. X. Mu, Y. Zhang, Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties, IEEE Trans. Neural Netw. Learn. Syst., 31 (2020), 259–273. doi: 10.1109/TNNLS.2019.2900510
[24]	Z. Li, X. Ma, Y. B. Li, Robust tracking control strategy for a quadrotor using RPD-SMC and RISE, Neurocomputing., 331 (2019), 312–322. doi: 10.1016/j.neucom.2018.11.070
[25]	Y. Wang, Y. Chenxie, J. Tan, C. Wang, Y. Wang and Y. Zhang, Fuzzy radial basis function neural network PID control system for a quadrotor UAV based on particle swarm optimization, IEEE Int. Conf. Inf. Autom., (2015), 2580–2585.
[26]	R. Zhang, J. Tao, R. Lu, Q. Jin, Decoupled ARX and RBF neural network modeling using PCA and GA optimization for nonlinear distributed parameter systems, IEEE Trans. Neural Netw. Learn. Syst., 29 (2018), 457–469. doi: 10.1109/TNNLS.2016.2631481
[27]	S. Q. Wang, M. Roger, J. Sarrazin, C. Lelandais-Perrault, Hyperparameter optimization of two-hidden-layer neural networks for power amplifiers behavioral modeling using genetic algorithms, IEEE Microw. Wirel. Compon. Lett., 29 (2019), 802–805. doi: 10.1109/LMWC.2019.2950801
[28]	W. J. Cai, J. H. She, M. Wu, Y. Ohyama, Disturbance suppression for quadrotor UAV using sliding-mode-observer-based equivalent-input-disturbance approach, ISA Trans., 92 (2019), 286–297. doi: 10.1016/j.isatra.2019.02.028
[29]	W. K. Alqaisi, B. Brahmi, J. Ghommam, M. Saad, V. Nerguizian, Adaptive sliding mode control based on RBF neural network approximation for quadrotor, IEEE Int. Symp. Robot. Sensors Environ., (2019), 1–7.
[30]	J. K. Liu, X. H. Wang, Advanced sliding mode control for mechanical systems: Design, analysis and MATLAB simulation, Tsinghua University Press, Beijing, 2011.

This article has been cited by:

1.	Wenqu Zeng, Guofu Ding, Zhongyang Li, Shuying Wang, Mapping method for complex mechatronic equipment relationship data to contextual semantic-constrained knowledge graphs, 2024, 0954-4054, 10.1177/09544054241253159
2.	S. Chan, 2024, Cascading Succession of Models for an Enhanced Long-Tail Discernment AI System, 979-8-3503-8780-3, 393, 10.1109/AIIoT61789.2024.10579029
3.	Gege Li, Yong Liu, Wei Deng, Yafei Jia, Aoqi Zhang, Difan Qi, 2024, Multi Feature Deep Fusion Based on Relationship Path Enhancement for Multi-Modal Knowledge Graph Completion, 979-8-3315-0658-2, 548, 10.1109/CBASE64041.2024.10824600

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

4.4

Metrics

Article views(4296) PDF downloads(357) Cited by(11)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(10) / Tables(4)

Mathematical Biosciences and Engineering

Optimized neural network based sliding mode control for quadrotors with disturbances

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Knowledge graph completion

2.2. Multi-modal task

2.3. Multi-modal knowledge graph completion

3. Problem formulation

4. Methodology

4.1. Single modal encoder

4.2. Multimodal feature fusion

4.3. Prediction block

5. Experiment

5.1. Dataset

5.2. Settings

5.3. Main results

6. Further analysis

6.1. Ablation study

6.2. Hyperparameter analysis

6.3. Complexity analysis

6.4. Study limitation

7. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Mathematical Biosciences and Engineering

Optimized neural network based sliding mode control for quadrotors with disturbances

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Knowledge graph completion

2.2. Multi-modal task

2.3. Multi-modal knowledge graph completion

3. Problem formulation

4. Methodology

4.1. Single modal encoder

4.2. Multimodal feature fusion

4.3. Prediction block

5. Experiment

5.1. Dataset

5.2. Settings

5.3. Main results

6. Further analysis

6.1. Ablation study

6.2. Hyperparameter analysis

6.3. Complexity analysis

6.4. Study limitation

7. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog