Loading [MathJax]/jax/output/SVG/jax.js
Research article

Linguistic summarisation of multiple entities in RDF graphs

  • Methods for producing summaries from structured data have gained interest due to the huge volume of available data in the Web. Simultaneously, there have been advances in natural language generation from Resource Description Framework (RDF) data. However, no efforts have been made to generate natural language summaries for groups of multiple RDF entities. This paper describes the first algorithm for summarising the information of a set of RDF entities in the form of human-readable text. The paper also proposes an experimental design for the evaluation of the summaries in a human task context. Experiments were carried out comparing machine-made summaries and summaries written by humans, with and without the help of machine-made summaries. We develop criteria for evaluating the content and text quality of summaries of both types, as well as a function measuring the agreement between machine-made and human-written summaries. The experiments indicated that machine-made natural language summaries can substantially help humans in writing their own textual descriptions of entity sets within a limited time.

    Citation: Elizaveta Zimina, Kalervo Järvelin, Jaakko Peltonen, Aarne Ranta, Kostas Stefanidis, Jyrki Nummenmaa. Linguistic summarisation of multiple entities in RDF graphs[J]. Applied Computing and Intelligence, 2024, 4(1): 1-18. doi: 10.3934/aci.2024001

    Related Papers:

    [1] Jiaming Ding, Peigang Jiao, Kangning Li, Weibo Du . Road surface crack detection based on improved YOLOv5s. Mathematical Biosciences and Engineering, 2024, 21(3): 4269-4285. doi: 10.3934/mbe.2024188
    [2] Yang Pan, Jinhua Yang, Lei Zhu, Lina Yao, Bo Zhang . Aerial images object detection method based on cross-scale multi-feature fusion. Mathematical Biosciences and Engineering, 2023, 20(9): 16148-16168. doi: 10.3934/mbe.2023721
    [3] Siyuan Shen, Xing Zhang, Wenjing Yan, Shuqian Xie, Bingjia Yu, Shizhi Wang . An improved UAV target detection algorithm based on ASFF-YOLOv5s. Mathematical Biosciences and Engineering, 2023, 20(6): 10773-10789. doi: 10.3934/mbe.2023478
    [4] Jiale Lu, Jianjun Chen, Taihua Xu, Jingjing Song, Xibei Yang . Element detection and segmentation of mathematical function graphs based on improved Mask R-CNN. Mathematical Biosciences and Engineering, 2023, 20(7): 12772-12801. doi: 10.3934/mbe.2023570
    [5] Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024
    [6] Songlin Liu, Shouming Zhang, Zijian Diao, Zhenbin Fang, Zeyu Jiao, Zhenyu Zhong . Pedestrian re-identification based on attention mechanism and Multi-scale feature fusion. Mathematical Biosciences and Engineering, 2023, 20(9): 16913-16938. doi: 10.3934/mbe.2023754
    [7] Zhigao Zeng, Cheng Huang, Wenqiu Zhu, Zhiqiang Wen, Xinpan Yuan . Flower image classification based on an improved lightweight neural network with multi-scale feature fusion and attention mechanism. Mathematical Biosciences and Engineering, 2023, 20(8): 13900-13920. doi: 10.3934/mbe.2023619
    [8] Zhijing Xu, Jingjing Su, Kan Huang . A-RetinaNet: A novel RetinaNet with an asymmetric attention fusion mechanism for dim and small drone detection in infrared images. Mathematical Biosciences and Engineering, 2023, 20(4): 6630-6651. doi: 10.3934/mbe.2023285
    [9] Ning Huang, Zhengtao Xi, Yingying Jiao, Yudong Zhang, Zhuqing Jiao, Xiaona Li . Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals. Mathematical Biosciences and Engineering, 2024, 21(8): 6918-6935. doi: 10.3934/mbe.2024304
    [10] Dawei Li, Suzhen Lin, Xiaofei Lu, Xingwang Zhang, Chenhui Cui, Boran Yang . IMD-Net: Interpretable multi-scale detection network for infrared dim and small objects. Mathematical Biosciences and Engineering, 2024, 21(1): 1712-1737. doi: 10.3934/mbe.2024074
  • Methods for producing summaries from structured data have gained interest due to the huge volume of available data in the Web. Simultaneously, there have been advances in natural language generation from Resource Description Framework (RDF) data. However, no efforts have been made to generate natural language summaries for groups of multiple RDF entities. This paper describes the first algorithm for summarising the information of a set of RDF entities in the form of human-readable text. The paper also proposes an experimental design for the evaluation of the summaries in a human task context. Experiments were carried out comparing machine-made summaries and summaries written by humans, with and without the help of machine-made summaries. We develop criteria for evaluating the content and text quality of summaries of both types, as well as a function measuring the agreement between machine-made and human-written summaries. The experiments indicated that machine-made natural language summaries can substantially help humans in writing their own textual descriptions of entity sets within a limited time.



    Crack is one of the common pavement diseases, and pavement cracks will reduce the efficiency of road traffic, and can even lead to serious traffic accidents and endanger life safety. Therefore, timely detection, accurate evaluation and repair of cracks are one of the key tasks in pavement maintenance. Figure 1 shows some examples of pavement cracks. In the face of the huge stock of domestic roads, the traditional manual crack detection method has been unable to meet the current demand, so the intelligent crack detection method, based on pavement image, has gradually attracted wide attention. However, pavement cracks have complex topological structure, uneven light and noisy texture background [1], making effective crack detection a significant challenge.

    Figure 1.  Examples of pavement crack.

    In recent years, deep learning has made significant breakthroughs in various computer vision tasks, and new deep neural network models are constantly emerging [2]. Deep convolutional neural networks (CNNs) [3,4] and Transformer neural network [5] are two representative models for semantic segmentation. Although CNN-based models have dominated this field since the advent of the fully convolutional network (FCN), the recent segmentation Transformer (SETR) [6] replaced the CNN encoder with a pure Transformer structure encoder, which altered the architecture of the current semantic segmentation models. However, the output feature map of Transformer has low resolution and lacks detailed information about cracks, and the decoding structure proposed by SETR does not effectively solve this problem, resulting in poor performance for detecting slender cracks.

    In this paper, we propose a multi-scale feature fusion network for pavement crack detection based on Transformer, including the vision Transformer (ViT) model, to extract crack features, dilated convolution to expand receptive field and upsampling to restore resolution and multi-scale feature fusion. Our method globally models the feature map and recovers the detailed information of cracks by fusing multi-scale features, so that we can achieve accurate segmentation of slender cracks and have a better generalization.

    The contributions of this work are as follows:

    (1) We propose an automatic crack segmentation method based on Transformer. Compared with existing pavement crack detection networks, our method has higher detection accuracy and better generalization.

    (2) To address the challenge that the ViT model can only output feature maps at a fixed resolution, we introduce a multi-scale feature fusion module as a solution.

    (3) A study of the influence of different brightness levels on the crack segmentation model performance.

    (4) We demonstrate the effectiveness of various blocks and their combinations, such as dilated convolution blocks and feature fusion blocks.

    Early researchers usually use digital image processing technology to realize the automatic detection of pavement cracks. The gray value of the crack and the gray value of the background often have obvious differences, and the crack can be effectively separated from the background by selecting the appropriate threshold [7,8,9]. However, the thresholding method is sensitive to noise and usually requires the addition of preprocessing and postprocessing operations. Edge detection is a method to segment an image based on the abrupt change and discontinuity of image gray level. At the boundary of the object, the gray value of the pixel often changes significantly [10]. Edge detection operators, such as Sobel [11] and Canny [12,13], can separate the crack from the background by calculating the gradient change of the crack boundary. However, edge detection is an ill-posed problem, and no edge detector can respond only to the features of the target object in the image [14].

    Although the crack detection methods based on digital image processing can achieve good detection results, the quality of the input image is required to be high, and the performance of these detection methods will be seriously affected when the crack and background contrast is not obvious, the light is uneven, or there is noise interference.

    In recent years, deep learning, especially deep convolutional neural networks, has been widely used in image classification, object detection, semantic segmentation and other fields, automatic extraction of image features by deep neural network and back propagation.

    Ronneberger et al. [4] proposed U-Net based on FCN [3] structure. The network is composed of encoder and decoder, has simple structure and fast training speed, and has been widely used in biomedical image segmentation field. Liu et al. [15] first applied U-Net in the field of crack segmentation, and achieved good results in small datasets. After the success of U-Net network in the field of crack detection, some scholars improved the U-Net network by adding the attention mechanism to enhance the identification of crack, so as to obtain more accurate semantic information [16,17,18]. Chen et al. [19] proposed an encoder-decoder network based on the SegNet [20] model and initialized with pretrained weights, which has high crack detection performance and generalization ability. Liu et al. [21] proposed a deep hierarchical convolutional neural network named DeepCrack, which consists of a fully convolutional network and a deeply supervised net (DSN), and the final feature map aggregates the multi-scale and multi-level features of different convolutional layers. The features of different convolution stages are directly supervised by DSN, and the end-to-end pixel-level crack segmentation is realized. Ren et al. [22] proposed a deep full convolutional neural network CrackSegNet, in which a dilated convolution and pyramid pooling module [23] were added to the network, and the context information was obtained by increasing the receptive field, thus realizing the effective segmentation of concrete cracks.

    However, due to the limited size of convolutional kernel, it is difficult for convolutional neural networks to obtain a larger receptive field. Although this problem can be solved by deepening the number of layers in the network, it will make the model too complicated and increase the calculation cost.

    Transformer model first emerged in the field of natural language processing, and was first applied to the task of image classification in [24], where the ViT model was proposed. In 2020, Zheng et al. [6] proposed the SETR segmentation algorithm based on the Transformer model, which uses the ViT to extract image features, providing a sequence-based image semantic segmentation perspective for subsequent researchers. CrackFormer [25] model adopts the self-attention mechanism to encode and decode the feature map, and combines the output of the corresponding encoder and decoder by scaling-attention block to obtain the clear crack boundary. SegFormer [26] model abandons position embedding and designs a layered Transformer encoder, which uses overlapping patch merging to downsample the feature map to obtain multi-scale features. The decoder only uses linear layers, which reduces the complexity of the network and achieves good segmentation results. SegCrack [27] model adopts the same encoder as SegFormer. When decoding, it uses lateral connection to restore the feature map scale layer by layer, and then fuses the feature maps of all scales to form a multi-scale feature map, which presents a more powerful representation by combining local features with global features. Feng et al. [28] used swin Transformer [29] to encode the crack image, and input the features of different encoder stages into the multi-layer perceptron (MLP) layer to unify the channel and size. Efficient and accurate segmentation of pavement cracks can be achieved by fusing the features of different stages. The TransMF [30] model uses a symmetric encoder and decoder structure, and added the fusion module. The encoder uses a hybrid model of convolution and Swin Transformer to model the crack from a local and global perspective, and the fusion module fuses both encoding and decoding features. The influence of noise can be reduced and the correlation between contexts can be strengthened.

    Compared with convolution operation, Transformer uses self-attention mechanism dynamic modeling to discover the importance of feature sequences, and adopts a sequence-to-sequence learning method with less inductive bias [31]. It can achieve better results in the detection of cracks, and has gradually become one of the mainstream methods of crack detection.

    As shown in Figure 2, the crack detection process consists of two main parts: training the network model and crack detection. First, the original crack image is inputted into the trained network model, and then the detected crack prediction map is outputted.

    Figure 2.  Flow chart of model training (left) and crack detection (right).

    In this paper, we proposed a multi-scale feature fusion network for pavement crack detection based on Transformer. The network architecture is shown in Figure 3, which is composed of encoder and decoder network. The encoder network adopts ViT model as the backbone, which is composed of Patch Embedding, Position Embedding and Transformer Encoder. The dilated convolution block with different combinations of dilation rates is added after the backbone, which can capture multi-scale information. The decoder consists of upsampling block, convolution block and the multi-scale feature fusion block proposed in this paper. The decoder network recovers the resolution of the feature map output by the encoder network layer by layer and fuses the multi-scale features to obtain the final segmentation result.

    Figure 3.  Network structure of proposed method.

    The Transformer encoder of ViT adpots the Pre-LN Transformer architecture [32], this structure puts the layer normalization block of Transformer model inside the residual structure, which can improve the convergence speed. In the encoding stage, first, the input image is split into 16 × 16 patches, and each patch is flattened into a one-dimensional vector, which named token. Then, the token is layer normalized, activated by the GeLU function and add position embeddings. Finally, the tokens are input into Transformer encoder to extract crack image features. The encoder is stacked with 12 identical encoder blocks. Each encoder block is composed of layer norm, multi-head self-attention, dropout and MLP block. Inside the encoder block, the input data distribution is unified into Gaussian distribution through the Layer Norm layer at first, then self-attention and full connection operation are performed. The self-attention formula [5] is as follows:

    Attention(Q,K,V)=softmax(QKTdk)V (1)

    where, Q, K, V are three matrices, obtained from the input features through three linear transformations, Q refers to the feature matrix of the crack area that needs to be attended to, K refers to the feature matrix of all locations in an entire image, V refers to the feature matrix of the location that corresponds to K. The matrix K is used to compute similarity with matrix Q, in order to perform a weighted average of the corresponding matrix V based on attention score. When the similarity between the matrix Q and matrix K is high, the corresponding matrix V is given a higher weight, and vice versa. This allows crack features to stand out while suppressing unnecessary information. dk is the dimension of the matrix K.

    In order to improve the detection effect of slender cracks, it is necessary to enlarge the receptive field to obtain long-distance dependence. The receptive field of the feature map can be increased without reducing the resolution by employing the dilated convolution, thereby capturing the context information [33].

    As shown in Figure 4, the dilated convolution block combines convolutions with different dilation rates, and the receptive fields of each layer are 3, 7 and 15, respectively. The feature sequence output from the backbone network is dimensionally transformed to obtain the feature map of size 16 × 16 × 768. In the dilated convolution block, the receptive field can cover the main part of the feature map. By fusing the feature maps of different receptive fields, it is helpful to obtain the context information and further extract the crack features.

    Figure 4.  Dilated convolution block.

    Low-level features contain more deatil information, which is helpful to restore the crack boundary and improve segmentation accuracy. This paper designed a simple and efficient feature fusion block. A linear layer is added after the output of the encoder block of the 3rd, 6th and 9th layers of Transformer encoder, respectively. The length of the feature sequence is changed by setting different numbers of neurons, which are 256 × 8192,256 × 4096 and 256 × 2048, respectively. After converting the dimension of the feature sequence, three feature maps with different scales are obtained, and the sizes of the feature maps are 128 × 128 × 128, 64 × 64 × 256 and 32 × 32 × 512, respectively.

    In the upsampling block, the resolution of the feature map is increased to two times and the channel number is reduced to one half of the original feature map, and fused with the feature map of the same scale, then the fused feature map is input to the convolution block for twice convolution operation, and batch norm and RuLU activation function operations are added after each convolution. Repeat the above operations to recover the resolution of the feature map layer by layer and fuse the multi-scale features, so as to obtain the final crack segmentation map.

    To validate the proposed methodology and conduct comparative analysis with other approaches, we selected two standard crack datasets, namely Crack500 [34] and DeepCrack [21], for our experimentation.

    (1) The Crack500 dataset contains a variety of complex pavement backgrounds and various types of asphalt pavement cracks, including 3368 pavement crack images with 640 × 360 pixels, each of which has a corresponding binary image labeled with pixel-level cracks. Among them, 2244 images are used for training and 1124 images are used for testing.

    (2) The DeepCrack dataset contains multi-scale and multi-scene concrete pavement cracks, including 537 concrete pavement crack images with 544 × 384 pixels and their corresponding crack labels. 300 images are used for training and 237 images are used for testing.

    Before training the model, the dataset was extended. Each image in the Crack500 dataset was clipped at the circumference and center with 256 × 256. After clipping, we count the total number of pixels of cracks in each crack image and crack pixels less than 1000 were deleted, then the crack image was rotated at 4 different angles, 90° each time. Due to the small amount of DeepCrack dataset, in addition to the above operation, horizontal flip operation has been added. When testing, only crop and delete operations were performed. After data expansion, there are 21704 images for training and 3278 images for testing on Crack500 dataset, and 7888 images for training and 900 images for testing on DeepCrack dataset.

    The experimental environment is Nvidia GeForce RTX3090 GPU, implemented on PyTorch framework. The unified size of input image for network model training and testing is 256 × 256 × 3.

    Pavement crack segmentation is a binary classification task, the binary cross entropy (BCE) loss function is used to measure the model's ability to correctly classify each pixel at the pixel level. The BCE loss is defined as follows:

    LBCE=1NNi[yilog(^yi)+(1yi)log(1^yi)] (2)

    where N represents the total number of pixels contained in the image, yi and ^yi represent the ground truth and prediction of point i, respectively.

    Due to the small proportion of crack pixels, the BCE loss function will bias the network towards learning background features. The Dice loss function [35] focuses more on preserving the detailed information of cracks in the image, especially for detecting crack boundaries with good performance. The Dice loss is defined as follows:

    LDice=12Niyi^yiNiyi+Ni^yi (3)

    We combine the Dice loss function and BCE loss function. The combined loss function pays more attention on crack region, and can effectively eliminate the problem caused by the imbalance of positive and negative samples [36]. The loss function of combination [37] is defined as:

    LTotal=LDice+αLBCE (4)

    where α is the weighting factor to balance the importance between the BCE loss function and the Dice loss function.

    In this paper, precision (Pr), recall (Re), and F1 score are used to evaluate the results of pavement crack detection, which are defined as:

    Pr=TPTP+FPRe=TPTP+FNF1=2Pr×RePr+Re (5)

    where TP refers the number of pixels correctly detected and classified as crack in the detection results, FP refers the number of background pixels misclassified as cracks, FN refers the number of pixels in the cracks misclassified as background.

    In order to verify the effectiveness of the proposed method, the following three sets of experiments are designed: 1) Model comparison experiments: Each model was trained and tested on DeepCrack and Crack500 datasets respectively, and the F1 score and other indicators of each model were compared; 2) Comparison experiment of generalization: The models were trained on the Crack500 dataset, and then the DeepCrack testset was copied three times, one was enhanced the brightness by 1.5 times, one was reduced the brightness to 0.5 times, and the other remained the same. Then the three testsets were tested separately to compare the generalization of each model; 3) Ablation experiment: DeepCrack dataset was used to train and test the models that removed the feature fusion block and the dilated convolution block, and the influence of each block on the model was evaluated.

    During the experiment, batchsize is set to 16, the epoch is set to 50, the initial learning rate is set to 1e-5 reduced by the decay rate 0.5 after every 5 epochs, the optimizer uses Adam, α is set to 0. 2.

    To illustrate the effectiveness of the proposed method, other state-of-the-art ones were selected as comparative methods, including CNN-based methods, such as U-Net [15] and CrackSegNet [22], and Transformer-based methods, such as SETR [6], SegFormer [26] and SegCrack [27]. The SETR adopts two different decoder designs named SETR-MLA and SETR-PUP, respectively, the backbone of them are both ViT-B/16. The backbone of SegCrack and SegFormer are MiT-B2.

    Tables 1 and 2 present the quantitative results of seven models on the Crack500 dataset and the DeepCrack dataset. According to the results, we have the highest Pr and F1 score, although Re is slightly lower than CrackSegNet and U-Net, it is still higher than the other methods. Since U-Net and CrackSegNet sacrifice Pr for higher Re, objects in a wider range will be identified as crack in the actual detection process, it will lead to serious false positive problems. Our method comprehensively considers the importance of Pr and Re, achieves the highest F1 score of 70.84% and 84.50% on the two datasets, respectively, which are 1.42% and 2.07% higher than the second method.

    Table 1.  Comparison results of various methods on Crack500 dataset.
    Methods Pr Re F1
    U-Net 64.49% 81. 25% 69.08%
    CrackSegNet 65.64% 78.30% 68.43%
    SegCrack 67.12% 77.64% 69.42%
    SegFormer 64.53% 77.15% 67.44%
    SETR-PUP 65.69% 78.48% 68.90%
    SETR-MLA 65.69% 72.53% 66.29%
    Ours 68.06% 79.11% 70.84%

     | Show Table
    DownLoad: CSV
    Table 2.  Comparison results of various methods on DeepCrack dataset.
    Methods Pr Re F1
    U-Net 83.92% 85.92% 82.43%
    CrackSegNet 74.59% 91.63% 79.94%
    SegCrack 83.34% 82.21% 80.28%
    SegFormer 82.61 % 82.54% 80.29%
    SETR-PUP 82.50% 80.21% 79.35%
    SETR-MLA 69.48% 75.31% 69.99%
    Ours 85.98% 86.20% 84.50%

     | Show Table
    DownLoad: CSV

    The segmentation results of various methods on the Crack500 dataset and DeepCrack dataset are shown in Figures 5 and 6. It can be seen that Transformer-based models, such as SegFormer, SETR-MLA and SETR-PUP are less affected by noise, because of larger receptive field, however, satisfactory results cannot be achieved when detecting slender crack. Benefited from its special structure, U-Net has the better segmentation result in slender crack by fusing feature maps of different layers, however, due to the small receptive field of the low-level feature map, there is still much noise in the segmentation results in Figures 4 and 5. Our method uses Transformer to extract crack features and fuses multi-scale features, which can effectively eliminate the negative impact of noise on performance and realize accurate segmentation of slender crack under complex background. It can be seen from precision-recall (P-R) curves in Figure 7, our method outperforms the other compared methods.

    Figure 5.  Detection results of different methods on the Crack500 dataset.
    Figure 6.  Detection results of different methods on the DeepCrack dataset.
    Figure 7.  P-R curves on Crack500 dataset (left) and DeepCrack dataset (right).

    To test the generalization ability of each model, after training the model on the Crack500 trainset, they are tested separately on the DeepCrack testset. The experimental results are shown in Table 3.

    Table 3.  Generalization experimental results on DeepCrack testset.
    Methods Pr Re F1
    U-Net 56.11% 86.40% 61.75%
    CrackSegNet 63.81% 81.44% 66.91%
    SegCrack 65.24% 85.31% 70.32%
    SegFormer 66.59% 85.01% 70.06%
    SETR-PUP 65.13% 86.84% 71.93%
    SETR-MLA 57.23% 81.77% 64.42%
    Ours 68.22% 90.25% 75.19%

     | Show Table
    DownLoad: CSV

    According to the results of generalization experiment on DeepCrack dataset, the PrRe and F1 score of the our method are 1.63%, 3.41% and 3.26% higher than the second method, respectively, demonstrating that the proposed model has high generalization ability.

    In addition, considering crack detection is often affected by light in the actual work. In order to simulate different light conditions, the testset of DeepCrack is processed with brightness enhancement and brightness decrease, respectively, the processed testsets are shown in Figure 8. The two testsets are tested separately, and the experimental results are shown in Tables 4 Table 5.

    Figure 8.  Crack images at different brightness.
    Table 4.  Results on DeepCrack testset after brightness enhancement.
    Methods Pr Re F1
    U-Net 60.55% 72.69% 59.77%
    CrackSegNet 52.84% 75.54% 56.71%
    SegCrack 72.89% 74.98% 67.52%
    SegFormer 78.26% 70.58% 67.98%
    SETR-PUP 67.50% 83.37% 71.59%
    SETR-MLA 53.89% 83.39% 62.21%
    Ours 72.11% 86.54% 75.78%

     | Show Table
    DownLoad: CSV
    Table 5.  Results on DeepCrack testset after brightness decrease.
    Methods Pr Re F1
    U-Net 38.82% 89.24% 48.05%
    CrackSegNet 49.22% 44.01% 39.46%
    SegCrack 76.08% 70.55% 65.70%
    SegFormer 61.98% 75.20% 59.05%
    SETR-PUP 49.81% 87.81% 59.94%
    SETR-MLA 69.55% 53.13% 53.86%
    Ours 70.29% 82.48% 71.44%

     | Show Table
    DownLoad: CSV

    It can be seen from the experimental results that the CrackSegNet performs poorly when the brightness is high, while other models are less affected. In the case of low image brightness, all models are affected to some extent when detecting cracks. U-Net has the highest Re, however, its Pr and F1 score are also the lowest among all models. CrackSegNet is difficult to distinguish crack from the background when the image brightness is low, so all evaluation indicators are the lowest. Our method is the least affected, F1 score can reach more than 70%, satisfactory crack segmentation results can still be obtained under this condition. As shown in Figure 9, our method can accurately detect cracks from the background in both brightness enhancement and dark conditions, it is further proved that our method has better generalization and can achieve better detection results under different lighting conditions.

    Figure 9.  Crack segmentation results at different brightness.

    To ascertain the effectiveness of the feature fusion and dilated convolution blocks, we performed ablation experiments on the DeepCrack dataset. The dilated convolution block is abbreviated as "DC", and the feature fusion block is denoted by "FF", The experimental results are detailed in Table 6.

    Table 6.  The results of ablation experiment on the DeepCrack dataset.
    Methods Pr Re F1
    Ours 74.64% 91. 21% 80. 40%
    Ours(DC) 84.53% 84 23% 82. 92%
    Ours(FF) 83.77% 86. 94% 83. 49%
    Ours(FF+DC) 85. 98% 86. 20% 84. 50%

     | Show Table
    DownLoad: CSV

    Table 6 presents the F1 score of the model, which amounts to 80.40% when the feature fusion block and dilated convolution block are not integrated. However, the inclusion of these blocks individually results in an increase of 3.09% and 2.52% in the F1 score, respectively. Remarkably, when both blocks are integrated, the F1 score rises by 4.10%. These findings serve as compelling evidence to support the effectiveness of dilated convolution and multi-scale feature fusion for crack detection.

    According to the above three experiments, it has been fully demonstrated that the proposed method in this paper has the advantages of high accuracy and high generalization, and has the ability to identify road cracks under different lighting conditions.

    In this paper, a multi-scale feature fusion method for pavement crack detection based on Transformer is proposed. In the encoding stage, the ViT model is adopted as the backbone, modeling the crack images from a sequence-to-sequence perspective. Compared to convolutional neural networks, it has a larger receptive field and can capture global information, which can better extract the crack features, and the dilated convolution block is added to increase the receptive field of the feature map, to further obtain the context information. In the decoding stage, we propose a multi-scale feature fusion module. The linear layer is employed to adjust the length of the feature sequence output by different encoder blocks of Transformer model, and then it is converted into feature maps of different scales. By fusing multi-scale semantic features, the detailed information can be recovered, and the accuracy of crack detection can be improved. We compare our method with other methods on Crack500 and DeepCrack datasets, and F1 scores of our method are 70.84% and 84.50%, respectively, which are better than other methods. In addition, in the generalization experiments, the proposed method has better generalization ability and can accurately identify cracks under different light conditions. The effectiveness of the dilated convolution block and feature fusion block is verified by setting ablation experiments.

    The proposed method can reduce labor costs, improve the detection efficiency and the accuracy of crack detection and can be extended to biomedical field such as the identification of retinal diseases from optical coherence tomography [38,39]. Although the proposed method has good performance, it also has some limitations. On the one hand, Transformer encoder needs a large amount of data for training. Manual annotation is time-consuming and subject to subjective influence, which will inevitably generate errors. On the other hand, the parameters and calculations of Transformer are very large, and we spent 10 hours training the model on the Crack500 dataset and 3.5 hours on the DeepCrack dataset, which took more time compared to training other models. Next, further investigation will be performed to improve the accuracy of crack detection, reduce the model's complexity and improve the detection efficiency of the model. In addition, exploration on other detection fields by using this method will be performed too.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work is supported by the National Natural Science Foundation of China (62001004), the Key Research and Development Program of Anhui Province (202104A07020017), the Academic Funding Program for Top-Notch Talents of Disciplines (Majors) in Universities of Anhui Province (gxbjZD2022028), the Research Project Reserve of Anhui Jianzhu University (2020XMK04), the New Era Education Quality Project of Anhui Province (2022cxcysj143 and 2022cxcysj145). All authors have read and agreed to the published version of the manuscript.

    The authors declare there is no conflict of interest.



    [1] V. Christophides, V. Efthymiou, K. Stefanidis, Entity resolution in the Web of data, Synthesis lectures on the Semantic Web: theory and technology, Morgan & Claypool Publishers, 2015. https://doi.org/10.1007/978-3-031-79468-1
    [2] H. Shah, P. Fränti, Combining statistical, structural, and linguistic features for keyword extraction from web pages, Applied computing and intelligence, 2 (2022), 115–132. https://doi.org/10.3934/aci.2022007 doi: 10.3934/aci.2022007
    [3] G. Cheng, T. Tran, Y. Qu, RELIN: relatedness and informativeness-based centrality for entity summarization, The Semantic Web–ISWC 2011, The Semantic Web–ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I 10, (2011), 114–129. https://doi.org/10.1007/978-3-642-25073-6_8
    [4] A. Thalhammer, A. Rettinger, Browsing DBPedia entities with summaries, The Semantic Web: ESWC 2014 Satellite Events, (2014), 511–515. https://doi.org/10.1007/978-3-319-11955-7_76
    [5] A. Thalhammer, N. Lasierra, A. Rettinger, LinkSUM: using link analysis to summarize entity data, International Conference on Web Engineering, (2016), 244–261. https://doi.org/10.1007/978-3-319-38791-8_14
    [6] G. Cheng, D. Xu, Y. Qu, Summarizing entity descriptions for effective and efficient human-centered entity linking, Proceedings of the 24th International Conference on World Wide Web, (2015), 184–194. https://doi.org/10.1145/2736277.2741094
    [7] G. Cheng, D. Xu, Y. Qu, C3d+ p: a summarization method for interactive entity resolution, Web Semantics: Science, Services and Agents on the World Wide Web, 35 (2015), 203–213. https://doi.org/10.1016/j.websem.2015.05.004 doi: 10.1016/j.websem.2015.05.004
    [8] J. Huang, W. Hu, H. Li, Y. Qu, Automated comparative table generation for facilitating human intervention in multi-entity resolution, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, (2018), 585–594.
    [9] K. Gunaratna, A. H. Yazdavar, K. Thirunarayan, A. Sheth, G. Cheng, Relatedness-based multi-entity summarization, Proceedings of the Twenty-national Joint Conference on Artificial Intelligence, (2017), 1060–1066. https://doi.org/10.24963/ijcai.2017/147
    [10] G. Troullinou, H. Kondylakis, K. Stefanidis, D. Plexousakis, Exploring RDFS KBs using summaries, The Semantic Web – ISWC, (2018), 268–284. https://doi.org/10.1007/978-3-030-00671-6_16
    [11] A. Aker, R. Gaizauskas, Generating descriptive multi-document summaries of geo-located entities using entity type models, J. Assoc. Inf. Sci. Tech., 66 (2015), 721–738. https://doi.org/10.1002/asi.23211 doi: 10.1002/asi.23211
    [12] H.Chen, J. Kuo, S. Huang, C. Lin, H. Wung, A summarization system for Chinese news from multiple sources, J. Am. Soc. Inf. Sci. Tech., 54 (2003), 1224–1236. https://doi.org/10.1002/asi.10315 doi: 10.1002/asi.10315
    [13] E. Baralis, L. Cagliero, S. Jabeen, A. Fiori, S. Shah, Multi-document summarization based on the Yago ontology, Expert Syst. Appl. 40 (2013), 6976–6984. https://doi.org/10.1016/j.eswa.2013.06.047
    [14] K. Gunaratna, K. Thirunarayan, A. Sheth, FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, (2015), 116–122. https://doi.org/10.1609/aaai.v29i1.9180
    [15] M. Sydow, M. Pikuła, R. Schenkel, The notion of diversity in graphical entity summarisation on semantic knowledge graphs, J. Intell. Inf. Syst., 41 (2013), 109–149. https://doi.org/10.1007/s10844-013-0239-6 doi: 10.1007/s10844-013-0239-6
    [16] B. Schäfer, P. Ristoski, H. Paulheim, What is special about Bethlehem, Pennsylvania? Identifying unusual facts about DBpedia entities, Proceedings of the ISWC 2015 Posters & Demonstrations Track, 2015.
    [17] N. Yan, S. Hasani, A. Asudeh, C. Li, Generating preview tables for entity graphs, Proceedings of the 2016 International Conference on Management of Data, (2016), 1797–1811. https://doi.org/10.1145/2882903.2915221
    [18] D. Xu, G. Cheng, Y. Qu, Facilitating human intervention in coreference resolution with comparative entity summaries, The Semantic Web: Trends and Challenges, ESWC 2014, Lecture Notes in Computer Science, (2014), 535–549. https://doi.org/10.1007/978-3-319-07443-6_36
    [19] D. Wei, Y. Liu, F. Zhu, L. Zang, W. Zhou, J. Han, et al., ESA: Entity Summarization with Attention, arXiv preprint arXiv: 1905.10625, 2019.
    [20] Q. Liu, G. Cheng, Y. Qu, DeepLENS: Deep Learning for Entity Summarization, arXiv preprint arXiv: 2003.03736, 2020.
    [21] Q. Liu, Y. Chen, G. Cheng, E. Kharlamov, J. Li, Y. Qu, Entity Summarization with User Feedback, ESWC 2020: The Semantic Web, (2020), 376–392. https://doi.org/10.1007/978-3-030-49461-2_22
    [22] A. Chisholm, W. Radford, B. Hachey, Learning to generate one-sentence biographies from Wikidata, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, (2017), 633–642. https://doi.org/10.18653/v1/E17-1060
    [23] R. Lebret, D. Grangier, M. Auli, Neural Text Generation from Structured Data with Application to the Biography Domain, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 1203–1213. https://doi.org/10.18653/v1/D16-1128
    [24] P. Vougiouklis, H. Elsahar, L. Kaffee, C. Gravier, F. Laforest, J. Hare, et al., Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples, Journal of Web Semantics, 52 (2018), 1–15. https://doi.org/10.1016/j.websem.2018.07.002 doi: 10.1016/j.websem.2018.07.002
    [25] C. Jumel, A. Louis, J. C. K. Cheung, TESA: A Task in Entity Semantic Aggregation for Abstractive Summarization, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (2020), 8031–8050. https://doi.org/10.18653/v1/2020.emnlp-main.646
    [26] A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, D. Radev, SummEval: Re-evaluating Summarization Evaluation, Transactions of the Association for Computational Linguistics, 9 (2021), 391–409. https://doi.org/10.1162/tacl_a_00373 doi: 10.1162/tacl_a_00373
    [27] E. Zimina, J. Nummenmaa, K. Järvelin, J. Peltonen, K. Stefanidis, H. Hyyrö, GQA: grammatical question answering for RDF data, Semantic Web Challenges: 5th SemWebEval Challenge at ESWC, (2018), 82–97. https://doi.org/10.1007/978-3-030-00072-1_8
    [28] T. Saracevic, Measuring the degree of agreement between searchers, Proceedings of the 47th Annual Meeting of the American Society for Information Science, 21 (1984), 227–230.
    [29] M. Azmy, P. Shi, I. Ilyas, J. Lin, Farewell Freebase: Migrating the SimpleQuestions Dataset to DBpedia, Proceedings of the 27th international conference on computational linguistics (2018), 2093–2103.
    [30] T. Tanon, D. Vrandečić, S. Schaffert, T. Steiner, L. Pintscher, From Freebase to Wikidata: The Great Migration, Proceedings of the 25th International Conference on World Wide Web, (2016), 1419–1428.
    [31] M. Dubey, D. Banerjee, A. Abdelkawi, J. Lehmann, LC-QuAD 2.0: A Large Dataset for Complex Question Answering over Wikidata and DBpedia, International Semantic Web Conference, (2019), 69–78. https://doi.org/10.1007/978-3-030-30796-7_5
    [32] M. Damova, D. Dannélls, R. Enache, M. Mateva, A. Ranta, Multilingual Natural Language Interaction with Semantic Web Knowledge Bases and Linked Open Data, in Towards the Multilingual Semantic Web: Principles, Methods and Applications, Buitelaar, P., Cimiano, P., Eds., Springer Berlin Heidelberg, (2014), 211–226. https://doi.org/10.1007/978-3-662-43585-4_13
    [33] D. Dannélls, Multilingual text generation from structured formal representations. PhD Thesis. University of Gothenburg, 2012.
  • This article has been cited by:

    1. Ali Sarhadi, Mehdi Ravanshadnia, Armin Monirabbasi, Milad Ghanbari, Optimizing Concrete Crack Detection: An Attention-Based SWIN U-Net Approach, 2024, 12, 2169-3536, 77575, 10.1109/ACCESS.2024.3403389
    2. Xiaohu Zhang, Haifeng Huang, Distilling Knowledge from a Transformer-Based Crack Segmentation Model to a Light-Weighted Symmetry Model with Mixed Loss Function for Portable Crack Detection Equipment, 2024, 16, 2073-8994, 520, 10.3390/sym16050520
    3. Yidan Yan, Junding Sun, Hongyuan Zhang, Chaosheng Tang, Xiaosheng Wu, Shuihua Wang, Yudong Zhang, DCMA-Net: A dual channel multi-scale feature attention network for crack image segmentation, 2025, 148, 09521976, 110411, 10.1016/j.engappai.2025.110411
    4. Ezz El-Din Hemdan, M. E. Al-Atroush, A Review Study of Intelligent Road Crack Detection: Algorithms and Systems, 2025, 1996-6814, 10.1007/s42947-025-00556-x
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1712) PDF downloads(108) Cited by(0)

Figures and Tables

Tables(9)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog