Loading [MathJax]/jax/output/SVG/jax.js
Research article Special Issues

Optimized neural network based sliding mode control for quadrotors with disturbances

  • In this paper, optimized radial basis function neural networks (RBFNNs) are employed to construct a sliding mode control (SMC) strategy for quadrotors with unknown disturbances. At first, the dynamics model of the controlled quadrotor is built, where some unknown external disturbances are considered explicitly. Then SMC is carried out for the position and the attitude control of the quadrotor. However, there are unknown disturbances in the obtained controllers, so RBFNNs are employed to approximate the unknown parts of the controllers. Furtherly, Particle Swarm optimization algorithm (PSO) based on minimizing the absolute approximation errors is used to improve the performance of the controllers. Besides, the convergence of the state tracking errors of the quadrotor is proved. In order to exposit the superiority of the proposed control strategy, some comparisons are made between the RBFNN based SMC with and without PSO. The results show that the strategy with PSO achieves quicker and smoother trajectory tracking, which verifies the effectiveness of the proposed control strategy.

    Citation: Ping Li, Zhe Lin, Hong Shen, Zhaoqi Zhang, Xiaohua Mei. Optimized neural network based sliding mode control for quadrotors with disturbances[J]. Mathematical Biosciences and Engineering, 2021, 18(2): 1774-1793. doi: 10.3934/mbe.2021092

    Related Papers:

    [1] Rongxing Qin, Lijuan Huang, Wei Xu, Qingchun Qin, Xiaojun Liang, Xinyu Lai, Xiaoying Huang, Minshan Xie, Li Chen . Identification of disulfidptosis-related genes and analysis of immune infiltration characteristics in ischemic strokes. Mathematical Biosciences and Engineering, 2023, 20(10): 18939-18959. doi: 10.3934/mbe.2023838
    [2] Yonghua Xue, Yiqin Ge . Construction of lncRNA regulatory networks reveal the key lncRNAs associated with Pituitary adenomas progression. Mathematical Biosciences and Engineering, 2020, 17(3): 2138-2149. doi: 10.3934/mbe.2020113
    [3] Ming-Xi Zhu, Tian-Yang Zhao, Yan Li . Insight into the mechanism of DNA methylation and miRNA-mRNA regulatory network in ischemic stroke. Mathematical Biosciences and Engineering, 2023, 20(6): 10264-10283. doi: 10.3934/mbe.2023450
    [4] Yong Ding, Jian-Hong Liu . The signature lncRNAs associated with the lung adenocarcinoma patients prognosis. Mathematical Biosciences and Engineering, 2020, 17(2): 1593-1603. doi: 10.3934/mbe.2020083
    [5] Lishui Shen, Xiaofeng Hu, Ting Chen, Guilin Shen, Dong Cheng . Integrated network analysis to explore the key mRNAs and lncRNAs in acute myocardial infarction. Mathematical Biosciences and Engineering, 2019, 16(6): 6426-6437. doi: 10.3934/mbe.2019321
    [6] Ziyu Wu, Sugui Wang, Qiang Li, Qingsong Zhao, Mingming Shao . Identification of 10 differently expressed lncRNAs as prognostic biomarkers for prostate adenocarcinoma. Mathematical Biosciences and Engineering, 2020, 17(3): 2037-2047. doi: 10.3934/mbe.2020108
    [7] Sisi Qi, Youyu Sheng, Ruiming Hu, Feng Xu, Ying Miao, Jun Zhao, Qinping Yang . Genome-wide expression profiling of long non-coding RNAs and competing endogenous RNA networks in alopecia areata. Mathematical Biosciences and Engineering, 2021, 18(1): 696-711. doi: 10.3934/mbe.2021037
    [8] Yun-xiang Li, Shi-ming Wang, Chen-quan Li . Four-lncRNA immune prognostic signature for triple-negative breast cancer. Mathematical Biosciences and Engineering, 2021, 18(4): 3939-3956. doi: 10.3934/mbe.2021197
    [9] Jie Chen, Jinggui Chen, Bo Sun, Jianghong Wu, Chunyan Du . Integrative analysis of immune microenvironment-related CeRNA regulatory axis in gastric cancer. Mathematical Biosciences and Engineering, 2020, 17(4): 3953-3971. doi: 10.3934/mbe.2020219
    [10] Beibei Zhu, Yue Mao, Mei Li . Identification of functional lncRNAs through constructing a lncRNA-associated ceRNA network in myocardial infarction. Mathematical Biosciences and Engineering, 2021, 18(4): 4293-4310. doi: 10.3934/mbe.2021215
  • In this paper, optimized radial basis function neural networks (RBFNNs) are employed to construct a sliding mode control (SMC) strategy for quadrotors with unknown disturbances. At first, the dynamics model of the controlled quadrotor is built, where some unknown external disturbances are considered explicitly. Then SMC is carried out for the position and the attitude control of the quadrotor. However, there are unknown disturbances in the obtained controllers, so RBFNNs are employed to approximate the unknown parts of the controllers. Furtherly, Particle Swarm optimization algorithm (PSO) based on minimizing the absolute approximation errors is used to improve the performance of the controllers. Besides, the convergence of the state tracking errors of the quadrotor is proved. In order to exposit the superiority of the proposed control strategy, some comparisons are made between the RBFNN based SMC with and without PSO. The results show that the strategy with PSO achieves quicker and smoother trajectory tracking, which verifies the effectiveness of the proposed control strategy.



    The continuous development of deep learning technology has had a significant impact on research in various fields. For instance, in the field of biomedicine, automatic diagnostic techniques based on deep learning have emerged, enabling image recognition and assisting healthcare professionals in the diagnosis and subsequent procedures [1,2,3,4,5]. Furthermore, deep learning has demonstrated a superior performance in scenarios with larger datasets, such as multi-view clustering [6,7,8,9,10]. In order to store and learn from a vast amount of information, knowledge graphs (KG) have emerged.

    A knowledge graph can be conceptualized as a large-scale semantic integration network, which represents entities as nodes and relationships as directed edges; thus, it stores a vast amount of human knowledge in the form of a directed graph. The resource description framework (RDF) provides a standard framework for KG representation, wherein fact triples (head, relationship, tail) are employed to describe knowledge [11]. The KG is capable of storing a rich amount of information regarding real-world entities and their relationships and can enable a range of reasoning processes across the graph. The graph-based approach to data processing has demonstrated a superior performance in tasks such as assisting information retrieval, question-answering systems, and recommendation systems, when compared to traditional structured data [12,13]. However, due to the infinite and constantly evolving nature of real-world knowledge, the incompleteness of the KG has led to the task of knowledge graph completion (KGC).

    In the field of natural language processing (NLP), KGC techniques can be broadly categorized into three types: rule-based models, path-based models, and embedding-based models. Rule-based models tend to retain the original semantic information more completely, and therefore offer better interpretability. Path-based models make a better use of and represent the graph structure, enabling guided reasoning through various path-searching mechanisms. Both of these approaches are more interpretable, though their expressiveness is limited by model constraints, and their spatiotemporal complexity is higher. Compared to the first two types of models, embedding-based models typically offer greater expressiveness. With the development of graph neural networks (GNNs), GNN-based models have shown great potential in various graph-based tasks, providing additional ideas for KGC. In recent years, KG has also been studied in computer vision, such as in the context of scene graphs and language and image integration.

    In recent years, multi-modal knowledge graphs (MKG) have gained significant attention as an extension to traditional knowledge graphs based on a single modality. MKGs typically augment semantic KGs with additional modality data, such as visual and audio attributes, to provide more physically rich representations of the world [14,15,16], as illustrated in Figure 1. For a given entity in the knowledge graph, we can use both image and text descriptions to supplement more detailed information that cannot be captured solely by the graph structure. Unfortunately, due to the lack of accumulated multi-modal corpora, existing MKGs often suffer from more severe incompleteness compared to traditional KGs, which greatly reduces their utility and effectiveness. In the task of multi-modal knowledge graph completion (MKGC), we must consider both the issues of multi-modal information fusion and the accuracy and interpretability of knowledge graph completion. In terms of multi-modal information fusion, we need to address issues such as semantic alignment, noise reduction or attenuation, and the realization of unified embeddings. In the process of link prediction, we must not only leverage the semantic richness of multi-modal information to improve accuracy but also enhance the logicality of the algorithm and improve its interpretability [17,18].

    Figure 1.  A simple multi-modal knowledge graph example.

    Despite the abundance of existing image-text embedding pre-training models, these models often focus on a single pair of corresponding images and text and fail to consider the distinctive structural features of KGs. Therefore, our research builds upon MKGs that contain image-text feature information. In addition to integrating embeddings from different modalities, we also retain local graph features and introduce path features to enhance the interpretability of the reasoning model. Specifically, we propose a method that first utilizes separate modality encoders to learn image and text embeddings, followed by an irrelevant filtering layer to further select semantically relevant key features. Next, we fuse and encode information from different modalities to obtain a multi-modal representation. We then use graph convolution algorithms and path features to extract structural features, and use a scoring function to predict missing triples. Our innovation can be summarized as follows:

    1) Designed a structure for extracting image-text information through single-modality encoding, followed by interaction fusion, and improved the semantic similarity through an irrelevant filtering module, thereby enhancing the fusion understanding of different modalities;

    2) Proposed a structure feature learning scheme that combines graph convolution and path embedding, thereby enhancing interpretability during the reasoning process;

    3) Achieved better results on two public datasets, FB15K-237-IMG and WN18-IMG.

    The task of knowledge graph completion has been widely studied, with typical sub-tasks including link prediction, entity prediction, and relation prediction, aimed at predicting missing triples (head, relation, tail) in the knowledge graph. Rule-based models such as AMIE and RLvLR utilize symbolic features to perform reasoning through either rule mining or rule searching algorithms [19,20]. NeuralLP introduced dynamic programming and further optimized rule mining through attention mechanisms and auxiliary memory [21]. Path-based models focus more on the paths between queried head and tail entities, and algorithms such as the path ranking algorithm (PRA) and random walks have been applied and further explored in such models. RNNPRA uses recurrent neural networks (RNN) to better learn path features for reasoning tasks [22]. DIVA proposed a unified reasoning framework that divides multi-hop reasoning into a path search and path inference steps [23]. The continuous development of deep reinforcement learning (DRL) techniques has enabled more effective multi-hop reasoning in sparse graphs. A series of models such as DeepPath and MultiHop have achieved more effective path exploration by designing new reward mechanisms [24,25].

    Currently, the more mainstream methods for solving KGC problems are focused on embedding-based models. Translation-based models such as TransE, TransR, and TransH embed entities and their relations by projection, and use a distance function to score the factual triplets [26,27,28]. Tensor factorization models such as RESCAL, Tucker, and LowFER use vectors to capture latent semantics through tensor decomposition and continuously improve model efficiency while reducing model size [29,30,31]. With the continuous improvement in neural networks (NN) in learning and expressing knowledge, additional embedding-based models choose to use neural network architectures to implement KGC. NTN uses neural tensor networks for relation reasoning in KG [32]. ConvE learns deeper features using two-dimensional convolutional layers [33]. InteractE processes more complex semantic information and KG interactions through multiple operations such as feature reshaping, feature permutation, and recurrent convolution [34]. Although CNN-based KGR models generally perform better than traditional NN models, the feature information contained in the graph structure itself has not been well utilized. Therefore, GNNs have been introduced into the KGC field to perform more complex reasoning tasks based on graph structure features. RGCN encodes each entity into a vector, uses specific transformations to aggregate neighborhood information for different relationship categories, and then reproduces facts through a decoder [35]. SACN uses weighted graph convolutional networks (WGCN) to implement the encoder, and then inputs the encoded information into a convolutional network for decoding [36]. NBF-Net and RED-GNN improve on traditional algorithms, choosing Bellman-Ford algorithms and dynamic programming to optimize the propagation strategy in previous GNN models, and achieve efficiency improvements [37,38].

    The traditional tasks in the two major fields of computer vision (CV) and natural language processing (NLP) have been extensively discussed, and more recent research has focused on cross-modal problems. The optimization and development of the Transformer model has led to a series of explorations into visual-text pre-training frameworks. VisualBERT is considered to be the first image-text pre-training model, which uses Faster R-CNN to extract visual features and connects them with text embeddings, which are then input into a transformer initialized by BERT [39]. Inspired by the feature extraction and architecture in the VisualBERT model, more pre-training models have been proposed by adjusting the pre-training tasks and datasets. CLIP uses a dataset of 400 million image-text pairs for pre-training, learning representations by directly matching raw text and corresponding images [40]. METER further explores single-modal feature extraction and processes multi-modal fusion using a dual-stream architecture model, achieving excellent performance on many downstream tasks [41].

    Numerous excellent multi-modal pretraining models have adopted the masked language modeling (MLM), masked visual modeling (MVM), and visual-linguistic matching (VLM) tasks as pretraining objectives; their corresponding downstream tasks are mainly focused on works that deal with the meaning and relationships between text and images, such as visual question answering (VQA), visual commonsense reasoning (VCR), and visual captioning (VC). However, for KGs, their distinguishing feature from semantically structured information is their graph structure. Recently, some studies have recognized the importance of structural features for handling KG-related tasks. DRAGON proposes a deep bidirectional, self-supervised pretraining method for language knowledge models from text and KGs [42]. Knowledge-CLIP takes entities and relations in KGs as inputs and extracts the original features of these entities and relations [43]. Entities can be in the form of images/text, while relations are described using language tokens. These pretraining models with structural features provide better options for MKG-related tasks.

    As an emerging research field, related work in MKGC is not yet systematic, and early MKGC tasks often directly added image information to the input of the original KGR model, which usually led to a suboptimal performance. To address this issue, many studies have made more attempts and explorations in the field of image-text feature fusion in MKG.

    IKRL first proposed an attention-based neural network to consider visual information in entity images [44]. TransAE introduced a KG representation learning method that integrates multi-channel (visual and language) information in a translation-based framework, and extended the definition of triple energy to consider new multi-channel representations [45]. MKBE and MRCGN integrated different neural encoders and decoders with relation models to embed learning and multi-modal data for inference [14,46]. MarT constructed a multi-channel analogical reasoning framework based on structural mapping theory to improve model interpretability [47]. MMKGR used a unified gate attention network to perform an attention interaction and to filter noise for generating more effective and reliable multi-modal complementary feature encoding, and designed a new reinforcement learning framework to predict missing elements in multi-hop reasoning processes [16]. MM-RNS proposed a multi-channel relation-enhanced negative sampling framework that provides bidirectional attention between visual and text features by integrating relation embeddings, and combined it with contrastive learning to construct an effective contrastive semantic sampler to improve MKGC performance [48].

    We have conducted a brief overview of the related models in traditional and multimodal KGs, as shown in Table 1.

    Table 1.  Summarization of existing KGC models.
    Knowledge Graph Completion Multi-Modal Knowledge Graph Completion
    Rule-based Models AMIE, RLvLR, NeuralLP -
    Path-based Models RNNPRA, DIVA, DeepPath, MultiHop -
    Embedding-based Models Translational Models TransE, TransR, TransH TransAE
    Tensor Decompositional Models RESCAL, Tucker, LowFER -
    Neural Network Models NTN, ConvE, InteractE, RGCN, SACN, NBF-Net, RED-GNN IKRL, MKBE, MRCGN, MarT, MMKGR, MM-RNS

     | Show Table
    DownLoad: CSV

    In order to provide a clearer demonstration of the effectiveness of the aforementioned work, we have provided a more detailed comparative analysis of selected algorithms in Table 2.

    Table 2.  Model performance comparison.
    Models Dataset Technique Performance(Hits@10)
    RLvLR FB75K Logic rule 43.4
    MultiHop FB15k-237 Relation path 56.4
    TransE FB15k-237 Translational 47.1
    LowFER FB15k-237 Tensor decompositional 54.4
    RED-GNN FB15k-237 GNN 55.8
    TransAE WN9-IMG Translational 94.84
    MMKGR WN9-IMG Attention 92.8

     | Show Table
    DownLoad: CSV

    The knowledge graph G={E,R,F} is a directed graph, where E is the entity set, R is the relation set, and F={(h,r,t)|hE,tE,rR} is the fact set consisting of fact triples (h,r,t). The head entity hE and tail entity tE are connected by a relation rR. For a multi-modal knowledge graph G, the entity e includes two modalities, namely textual information et and visual information ev.

    The purpose of multi-modal KGC is to infer incomplete triplets T={(h,r,t)|hE,tE,rR,(h,r,t)F} based on known fact triplets (h,r,t). In practice, the incomplete triplets that may appear in our prediction task can take three forms, namely (h,r,?), (h,?,t), and (?,r,t). In the implementation process, we input the feature information of entities e and relationships r into an encoder to obtain the corresponding embedding vectors h, r, t. Then, we use a scoring function f(h,r,t) to evaluate the probability of the truthfulness of inferred triplets. That is, when triplet (h,r,t)Gis true, f(h,r,t) scores 1, otherwise, when (h,r,t)G is true, f(h,r,t) scores 0. Taking a missing triplet in the form of (h,?,t) as an example, let us assume the existence of a relationship rpd between the head entity h and the tail entity t, thereby obtaining the complete triplet (h,rpd,t) with an unknown truthfulness. To evaluate the probability of its actual occurrence, we employ a scoring function, resulting in the output f(h,rpd,t). The basic terminology definitions are shown in Table 3.

    Table 3.  Notation summary.
    Notation Explanation
    G Multi-modal knowledge graph
    E Entity set
    R Relation set
    F Fact set
    T Incomplete fact set
    (h,r,t) Fact triplet of the head, relation, tail
    h Embedding of head entity
    t Embedding of tail entity
    r Embedding of relation entity

     | Show Table
    DownLoad: CSV

    The model we proposed, MLSFF, has an overall architecture shown in Figure 2, which consists of three components: 1) single-modality encoders for image and text embedding; 2) a multi-modal feature fusion mechanism with irrelevant filtering to discard interfering information and to reduce noise when the image and text features interact with each other; 3) a reasoning framework that combines the graph structure and path features, introduces a new scoring function containing multi-hop path features, and uses multi-modal features to predict incomplete triplets in KGC processes.

    Figure 2.  Overview of our model structure.

    The emergence of the Transformer model has caused a huge revolution in the NLP field and has been widely used in various tasks. The attempt to introduce the Transformer model into the CV field has not only achieved success, but even achieved astonishing results. Specifically when the pre-training data is large enough, Transformer's performance in CV will be significantly better than CNN, breaking the limitation of the original few inductive biases, and achieving better transfer effects in downstream tasks. We use independent image encoders and text encoders based on the Transformer architecture to extract features from the raw inputs. For a given triple, the entity and relation are sent to the corresponding encoder based on their modality (image or text). The relation represented by language tokens is sent to the text encoder similar to the text entity. The main architecture of our single-modality encoder is illustrated in Figure 3.

    Figure 3.  Structure of single modal encoder.

    Visual Encoder For image feature extraction, we adopt the embedding layer and Transformer encoder of the pre-trained model ViT as the main architecture [49]. Let C be the number of channels in the image (in RGB images, C=3) and the resolution of each image patch be (P,P). First, we scale the input image I to a unified resolution (A,B), and then divide it into N=AB/P2 patches. We use a linear mapping (i.e., FC layer) to transform each patch into a one-dimensional vector. This completes the embedding of the original image Xvpat. Subsequently, we feed the obtained image embedding and position embedding Xvpos into the Transformer encoder as an input. The overall forward calculation process is as follows:

    Xv0=Xvpat+Xvpos (4.1)
    ˆXvl=MSA(LN(Xvl1))+Xvl1,l=1,2,...,Lv (4.2)
    Xvl=FFN(LN(ˆXvl))+ˆXvl,l=1,2,...,Lv (4.3)

    The MSA Block consists of a multi-head attention mechanism, a layer normalization, and a skip connection (Layer Norm & Add), which can be repeated for Lv times, and the output of the lth block is ˆXvl. The MLP Block consists of feedforward neural network, layer normalization, and skip connection (Layer Norm & Add), which can be repeated for Lv times, and the output of the lth block is Xvl.

    Textual Encoder In NLP tasks, a large number of pre-training models based on the Transformer architecture have emerged, such as BERT, which has recently been widely applied and demonstrated great success in various downstream tasks [50,51]. In this paper, we use BERT to perform language modeling and feature extraction. Specifically, we divide the complete sentence into a word sequence and perform word embedding to obtain the word embeddings Xtword. In order to preserve sentence-level features, we also embed the entire sentence and align it with the word embeddings to obtain the sentence embeddings Xtsen. Then, we send the word embeddings Xtword, position embeddings Xtpos, and sentence embeddings Xtsen to the encoder.

    Xt0=Xtword+Xvsen+Xvpos (4.4)
    ˆXtl=LN(MSA(Xtl1))+Xtl1,l=1,2,...,Lt (4.5)
    Xtl=LN(FFN(ˆXtl))+ˆXtl,l=1,2,...,Lt (4.6)

    The difference between text encoding and visual encoding is that layer normalization (LN) is located after the multi-head self-attention (MSA) and feed-forward network (FFN) layers. Similarly, the output of the lth MSA block is denoted as ˆXtl and the output of the lth MLP block is denoted as Xtl. We denote the number of MSA and MLP blocks in the text encoder as Lt.

    In the multimodal fusion module, we fuse the separately encoded text and image information. Specifically, since relationships belong to a separate data category with certain label information, although they are usually described using text, their semantic relevance to the text and image descriptions of entities is relatively low. Therefore, we choose to fuse and filter the image and text information separately for relationships, and then introduce the encoded relationship attributes when learning the path features.

    To enhance the efficiency of the semantic interaction between the two different modalities of image and text, we adopt an intermediate representation to unify the multimodal information. On one hand, we aim to achieve a more fine-grained interaction between different modal feature information; on the other hand, since images often contain semantically irrelevant information, directly using the complete image embedding in the feature fusion process may introduce noise. Therefore, we feed the learned image and text vectors into a multimodal gated unit for weight learning to achieve the intermediate feature representation.

    gf=σ(XvWvXtWt) (4.7)
    ˆXm=gfXv+(1gf)Xt (4.8)

    In this equation, σ represents the sigmoid function, Xv and Xt denote the feature vectors outputted by the image and text encoders, respectively, Wv and Wt are parameter matrices, gf is a scalar within the range of [0,1], ˆXm represents the multi-modal embedding vector obtained through the filtering layer, and denotes the element-wise multiplication (i.e., Hadamard product).

    Later, we feed the original embeddings ˆXm into the multi-modal encoder to further learn the semantic features.

    X=Tran(ˆXm) (4.9)

    We have obtained the multi-modal feature embedding of a certain fact description through the previous structure, but this is insufficient for large-scale and complex KGs. Hence, we aim to further learn path features to better accomplish the task of KGC. The overall approach regarding the learning of structural features and completion can be summarized as follows. First, we extract a certain path existing in the MKG, connect the relations in the path, and then divide the path into several shorter components through a sliding window. Then, we select one of the components and use a recurrent attention unit to embed the selected component to obtain a relation vector, which is represented as a weighted combination of existing relations. We recursively merge the divided components of the path, and finally use a scoring function to determine the truthfulness of unknown triplets. The overall process of the prediction block shows in Algorithm 1.

    Algorithm 1 Prediction block
    Input: the path body rp
    Output: the score of triplet f(h,r,t)
     1: Initialize the window size w
     2: for all i=1,2,...,n1 do
     3: get path segments w={1,2,3} and encoding with LSTM [ˆyi,ˆyi+1]=LSTM(wi);
     4:  yi=ˆyi+1
     5:  end for
     6: μ=softmax([FC(y1),FC(y2),...,FC(yn+1w)])
     7: Y=n+1wi=1μiyi
     8: f(h,r,t)=σ(vec(([Xh;Y]ω)W)Xt)
     9: return f(h,r,t)

    Sliding Window Segmentation To extract fine-grained features from sampled paths, we decompose the sampled paths into combinations of different sizes using sliding windows of varying lengths. In the implementation, we use windows of size w={1,2,3}. Given the window size, the generated sliding windows traverse the path body rp=[rp1,...,rpn]. Then, we use a long short-term memory (LSTM) network as a sequence encoder to conceal the information within the sliding windows. Taking the sliding window of length 2 as an example,

    [ˆyi,ˆyi+1]=LSTM(wi) (4.10)

    Since the final state yi+1 usually contains the complete information of the sequence, we select yi=ˆyi+1. yi is meaningful to learn the relationship in the window if the relationship segments in the ith sliding window always appear together in some combination, which is more likely to represent a real "long-distance" relationship. To incorporate this observation into our model, we calculate the probability value of these relationship segments by:

    μ=softmax([FC(y1),FC(y2),...,FC(yn+1w)]) (4.11)

    where FC() represents a fully connected layer, which is used to learn the probability that the ith window in yi represents a meaningful relationship fragment. Finally, we calculate the weighted sum of information from different windows to represent the complete path features:

    Y=n+1wi=1μiyi (4.12)

    Scoring Function Considering the excellent performance of graph convolutional models in handling KGC problems, we choose the following scoring function:

    f(h,r,t)=σ(vec(([Xh;Y]ω)W)Xt) (4.13)

    In the proposed scoring function, Xh and Xt represent the multi-modal embeddings of the head and tail entities, respectively, while Y represents the embedding of their relationship, and ω denote the convolution operation and the convolution kernel, respectively, and vec() represents the projection operation from the feature map to the vector space, W is a parameter matrix. With the above method, we can compute whether a fact constructed by a certain relationship between two entities is true or not.

    For ease of reference, we summarize the main symbol notations used in this chapter in Table 4.

    Table 4.  Notation summary.
    Notation Explanation
    X Embedding entity vector
    ˆX Intermediate state of the embedding entity
    ˆy Intermediate state of the embedding path
    yi Embedding path vector
    Y Encoded Complete path vector
    W Parameter Matrix

     | Show Table
    DownLoad: CSV

    We evaluate the effectiveness of the MLSFF model on two publicly available datasets: (ⅰ) FB15K-237-IMG: a subset of the large-scale knowledge graph Freebase, where each entity has 10 images, and is a commonly used dataset in KGC tasks; (ⅱ) WN18-IMG: WN18 is a knowledge graph extracted from WordNet. WN18-IMG is an extended dataset of WN18, where each entity has 10 images [52]. These two datasets can be obtained as FB15k-WN18-images. Table 5 shows the statistical information of the datasets.

    Table 5.  Statistics of datasets.
    Datasets #Entities #Relations #Train #Dev #Test
    FB15k-237-IMG 14,541 237 272,115 17,535 20,466
    WN18-IMG 40,943 18 141,442 5000 5000

     | Show Table
    DownLoad: CSV

    Evaluation Metrics: We adopted classic knowledge graph completion evaluation metrics, including Hits@k and mean rank (MR), as shown in Table 6.

    Table 6.  Summarization of evaluation metrics.
    Evaluation metrics Calculation formula
    Hits@k Hits@k=i1(ranki)<kQ
    MR MR=1Qiranki

     | Show Table
    DownLoad: CSV

    Hits@k: The Hits@k metric is defined as the proportion of true entities that appear in the top-k ranked list of entities. It is calculated as follows:

    Hits@k=i1(ranki)<kQ (5.1)

    where ranki represents the rank of the expected entity of the ith incomplete fact triple. Q represents the total number of incomplete fact triples.

    Mean Rank (MR): Mean Rank is the arithmetic average of the individual entity ranks, defined as:

    MR=1Qiranki (5.2)

    Parameter Configuration To consider the model's scale and computational efficiency, we choose the ViT-B/16 pre-trained model for the image encoder. We set the embedding dimensions for both text and image to 768. The number of layers for both the image and text encoders is set to 12, while the number of layers for the modality encoder is set to 3. The graph embedding dimension is set to 200, and the batch size is set to 64. We utilize the Warmup algorithm and the ADAM optimizer to adjust the learning rate of the model parameters. The initial learning rate is set to 0.0005, and the dropout rate is set to 0.1.

    Baseline Setup We selected four unimodal methods and four multi-modal methods as baselines to compare with our proposed model. The unimodal methods include the following: 1) TransE [26], a classic translation-based model that encodes entities and relationships into a linear space; 2) DistMult [53], which uses a linear neural network to encode a multi-relation graph for multi-relation learning; 3) ComplEx [54], which solves both symmetric and asymmetric relations by introducing complex methods; and 4) RotatE [55], which defines relations as rotations from the head entity to the tail entity in a complex space to achieve multi-class reasoning. The multi-modal methods include the following: (ⅰ) IKRL (UNION) [44], which extends TransE to learn about visual representations of entities and structural features of KGs; (ⅱ) TransAE [56], which combines multi-modal encoders with TransE to achieve unified representation of visual and textual features; (ⅲ) RSME [57], which uses a forget gate to learn about valuable images for MKG embedding; and (ⅳ) MKGformer [52], which proposes an MKG pre-training model based on a hybrid transformer structure.

    The experimental results on the two datasets are shown in Table 7, which shows that our model generally outperforms the 8 baseline methods.

    Table 7.  Results of link prediction on FB15k-237-IMG and WN18-IMG.
    Model FB15k-237-IMG WN18-IMG
    Hits@1↑ Hits@3↑ Hits@10↑ MR↓ Hits@1↑ Hits@3↑ Hits@10↑ MR↓
    TransE 0.198 0.376 0.441 323 0.40 0.745 0.923 357
    DistMult 0.199 0.301 0.466 512 0.335 0.876 0.940 655
    ComplEx 0.194 0.297 0.450 546 0.936 0.945 0.947 -
    RotatE 0.241 0.375 0.533 177 0.942 0.950 0.957 254
    IKRL (UNION) 0.194 0.284 0.458 298 0.127 0.796 0.928 596
    TransAE 0.199 0.317 0.463 431 0.323 0.835 0.934 352
    RSME 0.242 0.344 0.467 417 0.943 0.951 0.957 223
    MKGformer 0.256 0.367 0.504 221 0.944 0.961 0.972 28
    MLSFF (ours) 0.274 0.411 0.552 193 0.951 0.973 0.980 22

     | Show Table
    DownLoad: CSV

    Firstly, in all works, the scores on FB15k-237-IMG are generally lower than those on WN18-IMG. The fundamental reason is that the dataset FB15k-237-IMG is more sparse and complex than the dataset WN18-IMG, with a greater variety of relationships between different entities. In addition, our model performs better on Hits@1 than on Hits@3 or Hits@10, indicating a superior discriminative ability in predicting unknown entities. In the MLSFF model, we use two single-modal encoders to extract image and text information, followed by a multi-modal layer for interaction, which enables full learning of semantic information for entity description. We introduce a sliding window in learning the link features, which realizes "scalable" path sampling and to some extent solves the problem of complex graph structures.

    Secondly, some traditional single-modal methods, such as RotatE, even outperform architectures that use multi-modal features in overall performance. This suggests that a well-designed relationship decomposition and learning rule are effective in solving complex graph problems, and fully utilizing structural features can improve prediction accuracy. Therefore, after obtaining multi-modal encoding information, our model not only uses the traditional graph convolutional method to obtain neighbor node information, but also incorporates long-distance path features and borrows from recurrent neural network structures used in processing text information to extract left and right node information from selected paths. By adding certain "vertical" features during the convolution process, our prediction model can have better interpretability.

    Finally, our model achieved significant improvements of 4.8 and 1.2% on the two datasets, respectively. However, in the FB15k-237-IMG dataset, our model's MR metric results were slightly inferior to those of the RotatE model. This could be attributed to the FB15k-237-IMG dataset containing a larger number of entities and a more diverse set of relationships, resulting in a sparser and more complex knowledge graph. While our model has improved its ability to learn about multi-hop path relationships to some extent, it lacks similar operations on negative samples, as seen in the RotatE model. As a result, this has impacted the overall accuracy. Overall, the experimental results demonstrate that our model outperforms existing methods on most evaluation metrics, with even more significant improvements observed on more complex knowledge graphs. This is because the MLSFF model learns more comprehensive semantic features by fusing information from both image and text modalities, enabling more comprehensive knowledge extraction from the graph. In addition, we employed convolutional operations that capture neighborhood information and an LSTM structure that learns path-level features to achieve a more comprehensive and three-dimensional feature encoding structure for learning graph structural features, which is highly effective for processing large-scale knowledge graphs.

    To investigate the actual effects of each component in the MLSFF model, we conducted ablation studies by removing some of the components.

    w/o SinE: To investigate the effect of the single-modal encoders on understanding image and text semantics, we aligned the one-dimensional vectorized image patches and text embeddings, calculated their Hadamard product, and directly fed them into the multi-modal encoder for learning.

    w/o Flt: To further investigate the actual effect of the unrelated filtering layer, we also experimented with the meaning of the multi-modal fusion module by directly fusing the encoded image and text features without the unrelated filtering layer.

    w/o Swin: To demonstrate the positive effect of extracting path information on learning graph structure features, we removed the sliding window encoding module and only used graph convolution operations to obtain structural embeddings.

    From Figure 4, it can be seen that using single-modality encoders to extract image and text features can effectively enhance semantic understanding and better learn human knowledge, thereby promoting and improving the performance in KGC tasks. Although image features can assist in text understanding, there is still some noise interference. Filtering out irrelevant information can further enhance the fusion effect between multi-modal features and improve accuracy. In addition, when facing large-scale and complex knowledge graphs, although graph convolutional operations can already fully learn structural information and capture neighbor features, the introduction of path and rule features can further improve model interpretability and prediction ability. Specifically, when dealing with sparse graphs, simple convolutional operations may lead to a certain decrease in accuracy, and learning path features can also help improve model efficiency.

    Figure 4.  Ablation on different components of the MLSFF.

    Our connection prediction module is mainly implemented based on the GNN algorithm, which aggregates neighbor information into the target node and then updates the target node based on the integrated information. However, this approach is prone to the problem of over-smoothing, where the representations of different nodes tend to become similar as the number of GNN layers increases during training. To address this issue, we introduce "longer-distance" path embeddings, which combine deep features and breadth features to extract complex graph structure information.

    We further explore effective graph processing structures by adjusting the number of convolutional layers and the size of the sliding window. In this work, considering memory and computational capacity, we conduct experiments with sliding window widths ranging from 1 to 3. As shown in Figure 5, the model performs better when the sliding window width is set to 2.

    Figure 5.  Impacts of the width on FB15k-237 and WN18RR.

    When the sliding window width is set to 2, our model can learn more layers of graph structural features and neighbor information. When the sliding window width is too small, that is, when the number of subgraphs learned is too few, the information in the knowledge graph cannot be fully aggregated to learn the structural information of the knowledge graph. In addition, some useful high-order neighbors cannot be captured. When the number of subgraphs is too large, the node representation is overly smoothed due to excessive noise.

    MLSFF: Denote the entity embedding dimension as de, the structural embedding dimension as dr, and the number of channels as T. The final output dimension for triplet encoding is denoted as m×n. The main complexity of our model can be represented as O(|E|de+|R|dr+Tmn+Td(2dmm+1)(dnn+1)).

    TransE: The scoring function of the TransE model is denoted as h+rt, and as a result, its algorithmic complexity can be represented as O(|E|d+|R|d).

    RED-GNN: As a GNN model in the traditional knowledge graph completion task, the RED-GNN model has an algorithmic complexity denoted as O(dmin(ˉDL,|F|L)). In this context, ˉD represents the average degree of the r-directed graph per layer. It can be observed that our model has a slightly higher computational complexity. This is attributed to two main factors: first, the inherent complexity of multimodal knowledge graphs; and second, the decision to incorporate a more extensive graph feature learning scheme to enhance the interpretability of paths.

    Despite the promising results and contributions of our study, there are some limitations that should be acknowledged:

    While our model aims to enhance interpretability by incorporating graph features and multi-hop paths, the interpretability of the model's predictions may still be limited. Explaining the reasoning behind specific predictions or understanding the underlying decision-making processes can be challenging, especially in complex multimodal knowledge graphs.

    In addition, the proposed model in this paper exhibits high complexity, which results in increased demands for computational resources and significant time consumption. Furthermore, our model does not consider the possibility of negative samples during the sampling process, which has an impact on the overall accuracy of the prediction task.

    We propose a MLSFF model which first uses two independent single-modality encoders to obtain pre-trained embeddings for both image and text information. Then, after filtering out irrelevant information, the multi-modal features are fused to obtain a unified encoding vector. We utilize graph convolutional algorithms to learn the structural information in the knowledge graph. In addition, we introduce path-based feature information into the graph structural features to obtain richer relationship representations. Our experimental results demonstrate that our model achieves better performance in MKGC tasks. To address the issues of high complexity and the omission of negative samples in our model, we will focus on the following areas for future research: (ⅰ) designing simpler and more efficient scoring functions that are more streamlined and computationally efficient; (ⅱ) considering negative sample interference, thereby mitigating their impact on the accuracy of the prediction task; (ⅲ) incorporating additional modalities: to achieve a more comprehensive and diverse multimodal fusion such as numerical features and enhancing the overall performance of the model.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by the National Natural Science Foundation of China-China State Railway Group Co., Ltd. Railway Basic Research Joint Fund (Grant No.U2268217) and the Scientific Funding for China Academy of Railway Sciences Corporation Limited (No.2021YJ183).

    The authors declare there is no conflict of interest.



    [1] N. Fethalla, M. Saad, H. Michalska, J. Ghommam, Robust observer-based dynamic sliding mode controller for a quadrotor UAV, IEEE Access, 6 (2018), 45846–45859. doi: 10.1109/ACCESS.2018.2866208
    [2] A. Levant, Principles of 2-sliding mode design, Automatica, 43 (2007), 576–586.
    [3] R. Xu, Ü. Özgüner, Sliding mode control of a class of underactuated systems, Automatica, 44 (2008), 233–241.
    [4] D. Lee, H. J. Kim and S. Sastry, Feedback linearization vs. adaptive sliding mode control for a quadrotor helicopter, Int. J. Control Autom. Syst., 7(2009), 419–428. doi: 10.1007/s12555-009-0311-8
    [5] E. H. Zheng, J. J. Xiong, J. L. Luo, Second order sliding mode control for a quadrotor UAV, ISA Trans., 53 (2014), 1350–1356. doi: 10.1016/j.isatra.2014.03.010
    [6] J. J. Xiong, E. H. Zheng, Position and attitude tracking control for a quadrotor UAV, ISA Trans., 53 (2014), 725–731. doi: 10.1016/j.isatra.2014.01.004
    [7] J. J. Xiong, G. B. Zhang, Global fast dynamic terminal sliding mode control for a quadrotor UAV, ISA Trans., 66 (2017), 233–240. doi: 10.1016/j.isatra.2016.09.019
    [8] H. Ríos, R. Falcón, O. A. González, A. Dzul, Continuous sliding-mode control strategies for quadrotor robust tracking: real-time application, IEEE Trans. Ind. Electron., 66 (2019), 1264–1272. doi: 10.1109/TIE.2018.2831191
    [9] B. X. Mu, K. W. Zhang and Y. Shi, Integral sliding mode flight controller design for a quadrotor and the application in a heterogeneous multi agent system, IEEE Trans. Ind. Electron., 64 (2017), 9389–9398. doi: 10.1109/TIE.2017.2711575
    [10] H. P. Wang, X. F. Ye, Y. Tian, G. Zheng, N. Christov, Model-free based terminal SMC of quadrotor attitude and position, IEEE Trans. Aerosp. Electron. Syst., 52 (2016), 2519–2528. doi: 10.1109/TAES.2016.150303
    [11] O. Mofid, S. Mobayen, Adaptive sliding mode control for finite time stability of quadrotor UAVs with parametric uncertainties, ISA Trans., 72 (2018), 1–14. doi: 10.1016/j.isatra.2017.11.010
    [12] L. Derafa, A. Benallegue, L. Fridman, Super twisting control algorithm for the attitude tracking of a four rotors UAV, J. Franklin Inst., 349 (2012), 685–699. doi: 10.1016/j.jfranklin.2011.10.011
    [13] M. Boukattaya, N. Mezghani, T. Damak, Adaptive nonsingular fast terminal sliding-mode control for the tracking problem of uncertain dynamical systems, ISA Trans., 77 (2018), 1–19. doi: 10.1016/j.isatra.2018.04.007
    [14] M. Labbadi, M. Cherkaoui, Robust adaptive nonsingular fast terminal sliding-mode tracking control for an uncertain quadrotor UAV subjected to disturbances, ISA Trans., 99 (2020), 290–304. doi: 10.1016/j.isatra.2019.10.012
    [15] B. Wang, X. Yu, L. X. Mu, Y. M. Zhang, Disturbance observer-based adaptive fault-tolerant control for a quadrotor helicopter subject to parametric uncertainties and external disturbances, Mech. Syst. Signal Proc., 120 (2019), 727–743. doi: 10.1016/j.ymssp.2018.11.001
    [16] G. Q. Zhu, S. Wang, L. F. Sun, W. C. Ge, X. Y. Zhang, Output feedback adaptive dynamic surface sliding-mode control for quadrotor UAVs with tracking error constraints, Complexity, 2020 (2020), 1–23.
    [17] M. Labbadi, M. Cherkaoui, Robust adaptive backstepping fast terminal sliding mode controller for uncertain quadrotor UAV, Aerosp. Sci. Technol., 93 (2019), 105306.
    [18] H. Razmi, S. Afshinfar, Neural network-based adaptive sliding mode control design for position and attitude control of a quadrotor UAV, Aerosp. Sci. Technol., 91 (2019), 12–27. doi: 10.1016/j.ast.2019.04.055
    [19] Y. M. Chen, Y. L. He, M. F. Zhou, Decentralized PID neural network control for a quadrotor helicopter subjected to wind disturbance, J. Cent. South Univ., 22 (2015), 168–179. doi: 10.1007/s11771-015-2507-9
    [20] S. S. Li, Y. N. Wang, J. H. Tan, Y. Zheng, Adaptive RBFNNs/integral sliding mode control for a quadrotor aircraft, Neurocomputing., 216 (2016), 126–134. doi: 10.1016/j.neucom.2016.07.033
    [21] O. Doukhi, D. J. Lee, Neural network-based robust adaptive certainty equivalent controller for quadrotor UAV with unknown disturbances, Int. J. Control Autom. Syst., 17 (2019), 2365–2374. doi: 10.1007/s12555-018-0720-7
    [22] Q. Z. Xu, Z. S. Wang, Z. Y. Zhen, Adaptive neural network finite time control for quadrotor UAV with unknown input saturation, Nonlinear Dyn., 98 (2019), 1973–1998. doi: 10.1007/s11071-019-05301-1
    [23] C. X. Mu, Y. Zhang, Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties, IEEE Trans. Neural Netw. Learn. Syst., 31 (2020), 259–273. doi: 10.1109/TNNLS.2019.2900510
    [24] Z. Li, X. Ma, Y. B. Li, Robust tracking control strategy for a quadrotor using RPD-SMC and RISE, Neurocomputing., 331 (2019), 312–322. doi: 10.1016/j.neucom.2018.11.070
    [25] Y. Wang, Y. Chenxie, J. Tan, C. Wang, Y. Wang and Y. Zhang, Fuzzy radial basis function neural network PID control system for a quadrotor UAV based on particle swarm optimization, IEEE Int. Conf. Inf. Autom., (2015), 2580–2585.
    [26] R. Zhang, J. Tao, R. Lu, Q. Jin, Decoupled ARX and RBF neural network modeling using PCA and GA optimization for nonlinear distributed parameter systems, IEEE Trans. Neural Netw. Learn. Syst., 29 (2018), 457–469. doi: 10.1109/TNNLS.2016.2631481
    [27] S. Q. Wang, M. Roger, J. Sarrazin, C. Lelandais-Perrault, Hyperparameter optimization of two-hidden-layer neural networks for power amplifiers behavioral modeling using genetic algorithms, IEEE Microw. Wirel. Compon. Lett., 29 (2019), 802–805. doi: 10.1109/LMWC.2019.2950801
    [28] W. J. Cai, J. H. She, M. Wu, Y. Ohyama, Disturbance suppression for quadrotor UAV using sliding-mode-observer-based equivalent-input-disturbance approach, ISA Trans., 92 (2019), 286–297. doi: 10.1016/j.isatra.2019.02.028
    [29] W. K. Alqaisi, B. Brahmi, J. Ghommam, M. Saad, V. Nerguizian, Adaptive sliding mode control based on RBF neural network approximation for quadrotor, IEEE Int. Symp. Robot. Sensors Environ., (2019), 1–7.
    [30] J. K. Liu, X. H. Wang, Advanced sliding mode control for mechanical systems: Design, analysis and MATLAB simulation, Tsinghua University Press, Beijing, 2011.
  • This article has been cited by:

    1. Wenqu Zeng, Guofu Ding, Zhongyang Li, Shuying Wang, Mapping method for complex mechatronic equipment relationship data to contextual semantic-constrained knowledge graphs, 2024, 0954-4054, 10.1177/09544054241253159
    2. S. Chan, 2024, Cascading Succession of Models for an Enhanced Long-Tail Discernment AI System, 979-8-3503-8780-3, 393, 10.1109/AIIoT61789.2024.10579029
    3. Gege Li, Yong Liu, Wei Deng, Yafei Jia, Aoqi Zhang, Difan Qi, 2024, Multi Feature Deep Fusion Based on Relationship Path Enhancement for Multi-Modal Knowledge Graph Completion, 979-8-3315-0658-2, 548, 10.1109/CBASE64041.2024.10824600
  • Reader Comments
  • © 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4296) PDF downloads(357) Cited by(11)

Figures and Tables

Figures(10)  /  Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog