
This study intends to investigate the performance of boosted regression tree (BRT) and frequency ratio (FR) models in groundwater potential mapping. For this purpose, location of the springs was determined in the western parts of the Mashhad Plain using national reports and field surveys. In addition, thirteen groundwater conditioning factors were prepared and mapped for the modelling process. Those factor maps are: slope degree, slope aspect, altitude, plan curvature, profile curvature, slope length, topographic wetness index, distance from faults, distance from rivers, river density, fault density, land use, and lithology. Then, frequency ratio and boosted regression tree models were applied and groundwater potential maps (GPMs) were produced. In the last step, validation of the models was carried out implementing receiver operating characteristics (ROC) curve. According to the results, BRT had area under curve of ROC (AUC-ROC) of 87.2%, while it was seen that FR had AUC-ROC of 83.2% that implies acceptable operation of the models. According to the results of this study, topographic wetness index was the most important factor, followed by altitude, and distance from rivers. On the other hand, aspect, and plan curvature were seen to be the least important factors. The methodology implemented in this study could be used for other basins with similar conditions to cope with water resources problem.
Citation: Seyed Mohsen Mousavi, Ali Golkarian, Seyed Amir Naghibi, Bahareh Kalantar, Biswajeet Pradhan. GIS-based Groundwater Spring Potential Mapping Using Data Mining Boosted Regression Tree and Probabilistic Frequency Ratio Models in Iran[J]. AIMS Geosciences, 2017, 3(1): 91-115. doi: 10.3934/geosci.2017.1.91
[1] | Eric Ariel L. Salas, Sakthi Subburayalu Kumaran, Robert Bennett, Leeoria P. Willis, Kayla Mitchell . Machine Learning-Based Classification of Small-Sized Wetlands Using Sentinel-2 Images. AIMS Geosciences, 2024, 10(1): 62-79. doi: 10.3934/geosci.2024005 |
[2] | Nudthawud Homtong, Wisaroot Pringproh, Kankanon Sakmongkoljit, Sattha Srikarom, Rungtiwa Yapun, Ben Wongsaijai . Remote sensing-based groundwater potential evaluation in a fractured-bedrock mountainous area. AIMS Geosciences, 2024, 10(2): 242-262. doi: 10.3934/geosci.2024014 |
[3] | Watcharin Phoemphon, Bantita Terakulsatit . Assessment of groundwater potential zones and mapping using GIS/RS techniques and analytic hierarchy process: A case study on saline soil area, Nakhon Ratchasima, Thailand. AIMS Geosciences, 2023, 9(1): 49-67. doi: 10.3934/geosci.2023004 |
[4] | Nicola Gabellieri, Ettore Sarzotti . Forest planning, rural practices, and woodland cover in an 18th-century Alpine Valley (Val di Fiemme, Italy): A geohistorical and GIS-based approach to the history of environmental resources. AIMS Geosciences, 2024, 10(4): 767-791. doi: 10.3934/geosci.2024038 |
[5] | Gianni Petino, Gaetano Chinnici, Donatella Privitera . Heritage and carob trees: Where the monumental and landscape intersect. AIMS Geosciences, 2024, 10(3): 623-640. doi: 10.3934/geosci.2024032 |
[6] | Maurizio Barbieri, Tiziano Boschetti, Giuseppe Sappa, Francesca Andrei . Hydrogeochemistry and groundwater quality assessment in a municipal solid waste landfill (central Italy). AIMS Geosciences, 2022, 8(3): 467-487. doi: 10.3934/geosci.2022026 |
[7] | Francesca Romana Lugeri, Barbara Aldighieri, Piero Farabollini, Fabrizio Bendia, Alberto Cardillo . Territorial knowledge and cartographic evolution. AIMS Geosciences, 2022, 8(3): 452-466. doi: 10.3934/geosci.2022025 |
[8] | Maria C. Mariani, Hector Gonzalez-Huizar, Masum Md Al Bhuiyan, Osei K. Tweneboah . Using Dynamic Fourier Analysis to Discriminate Between Seismic Signals from Natural Earthquakes and Mining Explosions. AIMS Geosciences, 2017, 3(3): 438-449. doi: 10.3934/geosci.2017.3.438 |
[9] | Rosazlin Abdullah, Firuza Begham Mustafa, Subha Bhassu, Nur Aziaty Amirah Azhar, Benjamin Ezekiel Bwadi, Nur Syabeera Begum Nasir Ahmad, Aaronn Avit Ajeng . Evaluation of water and soil qualities for giant freshwater prawn farming site suitability by using the AHP and GIS approaches in Jelebu, Negeri Sembilan, Malaysia. AIMS Geosciences, 2021, 7(3): 507-528. doi: 10.3934/geosci.2021029 |
[10] | Alexander Fekete . Urban and Rural Landslide Hazard and Exposure Mapping Using Landsat and Corona Satellite Imagery for Tehran and the Alborz Mountains, Iran. AIMS Geosciences, 2017, 3(1): 37-66. doi: 10.3934/geosci.2017.1.37 |
This study intends to investigate the performance of boosted regression tree (BRT) and frequency ratio (FR) models in groundwater potential mapping. For this purpose, location of the springs was determined in the western parts of the Mashhad Plain using national reports and field surveys. In addition, thirteen groundwater conditioning factors were prepared and mapped for the modelling process. Those factor maps are: slope degree, slope aspect, altitude, plan curvature, profile curvature, slope length, topographic wetness index, distance from faults, distance from rivers, river density, fault density, land use, and lithology. Then, frequency ratio and boosted regression tree models were applied and groundwater potential maps (GPMs) were produced. In the last step, validation of the models was carried out implementing receiver operating characteristics (ROC) curve. According to the results, BRT had area under curve of ROC (AUC-ROC) of 87.2%, while it was seen that FR had AUC-ROC of 83.2% that implies acceptable operation of the models. According to the results of this study, topographic wetness index was the most important factor, followed by altitude, and distance from rivers. On the other hand, aspect, and plan curvature were seen to be the least important factors. The methodology implemented in this study could be used for other basins with similar conditions to cope with water resources problem.
The information obtained from various sources is commonly referred to as multi-modal information. At present, research in this field often begins with the aim of combining multi-modal information and maximizing the potential of different modal information in real life to create new cross-modal research areas [1,2,3,4]. The demand for content-based video retrieval has increased with the widespread use of video platforms such as TikTok and YouTube. Cross-modal retrieval [5,6,7] has gained significant attention from scholars in recent years.
However, text and video modes of conveying information differ significantly. The text conveys information through words or phrases, while video encompasses a broader range of information. The text only partially represents some of the content in a video [8]. When performing sentence and video matching, adopting a more precise approach may be necessary due to redundant frames in videos. For instance, Figure 1 shows two sample videos and their matching text from the MSR-VTT dataset [9]. Observations have shown that some video frames do not match the semantic meaning of the sentences and are deemed redundant components in the text-video matching procedure [10]. This highlights the presence of some bias in the results when matching videos with sentences.
To enhance the precision of text-video retrieval, a proven practical approach involves the elimination of redundant frames and a focused analysis of video subintervals that are semantically linked to the text, as demonstrated by Dong et al. [11]. Consequently, the matching model must possess the capability to extract critical textual content and correctly identify corresponding video segments.However, prevalent methods commonly resort to mean pooling or employ self-attention mechanisms [6,12,13] to derive global feature embeddings for a given sentence and video. These methods exhibit limitations regarding the definition and precise localization of keyframes. This limitation is significant, as it can negatively impact the performance of retrieval tasks, particularly when videos contain content that is not explicitly described in the accompanying text.
This paper presents a novel cross-modal conditional mechanism designed to address two critical issues: feature redundancy within videos and the imbalanced semantic alignment between the two modalities. To tackle these challenges, we propose two key components:
Firstly, we introduce a video feature aggregation module that leverages a text-conditioned attention mechanism. Here, the text features serve as a guiding condition, enhancing the emphasis on keyframes while diminishing the influence of redundant frames. The aggregation process combines numerical values by multiplying attention weights with frame features, resulting in refined video features.
Secondly, we introduce a global-local cross-modal conditional similarity calculation module. This module considers video-sentence features as the global data and frame-word features as the local data. These features are input into the model for similarity calculation, a critical step for text-video matching. This approach enables us to effectively address the overarching topic and the finer details in the matching process.
Text Feature Encoding Previous studies [14,15,16,17] examined the extraction of text features and have achieved exceptional outcomes. The Skip-Gram model [18] begins by considering the central word and predicts its surrounding words, producing text features. This approach not only cuts down the computational effort during the training phase but also elevates the quality of the word-to-vector representation. To enhance the matching between text and images, the Fisher Vector model [19] examines text representation and quantifies it using high-level statistical features for text feature extraction [20]. Furthermore, the GRU model [21] was introduced to address the issue of gradient disappearance in standard RNNs during text encoding. As a result, the GRU model has become a widely utilized text encoder. OpenAI has also made available a graphical pre-training model for CLIP based on contrast learning and has a transformer structure [22]. CLIP model has delivered similarly remarkable results in the coding of text.
Visual Feature Encoding Visual feature extraction is typically carried out using supervised or self-supervised research methods. Recently, there has been growing interest in using a transformer-based image encoder known as the ViT model [23]. While the application of transformers to feature extraction of video content [24,25] is still in its early stages, it has shown potential for enhancing action classification in video text retrieval. Researchers have been exploring new and innovative approaches to enable models with better generalization capabilities [22,49]. Text and video pairs obtained from the internet are collected and formed into large-scale datasets for training. One of the most successful methods is the CLIP model [22], which has achieved state-of-the-art performance in image feature extraction. The pre-trained CLIP model can learn more sophisticated visual concepts and use these features in retrieval tasks. To mitigate the impact of diverse topics in the dataset, a MIL-NCE model [26] based on the CLIP video encoder has been proposed and tested with positive results on the Howto100M [13] dataset. Furthermore, the ClipBERT [49] model, which is based on the MIL-NCE model, employs an end-to-end approach to streamline the pre-processing stage of video-text retrieval. This paper uses a pre-trained CLIP-based ViT model as our video encoder to extract visual features from the video frames. The effectiveness of the feature extraction has been verified through experimental evaluation.
In cross-modal retrieval, text-video matching plays a key role in bridging vision and language. Text-video matching aims to learn the cross-modal similarity function [27] between text and video, so that related text-video pairs receive higher scores than unrelated ones. Establishing a semantic similarity model that effectively reduces the semantic gap between visual and textual information is crucial for the accuracy of this study [28]. Despite the complex matching patterns and vast semantic differences between images and texts, this remains a challenging research topic. A common approach to overcome this challenge is mapping images and texts into a shared semantic space through a suitable embedding model, i.e., a joint latent space, and then computing cross-modal similarity in this shared space.
Text-video retrieval is typically achieved by integrating a pre-trained language model with a visual model to associate text features with visual features. When dealing with small datasets, incorporating a pre-trained model can improve performance. For instance, the Teachtext model [23] uses multiple text encoders to provide a complementary supervised signal for the retrieval model. MMT [29] and MDMMT [30] were early examples of using transformers for multi-modal video processing, integrating three modal features to accomplish the video retrieval task.
Additionally, some scholars have applied concepts from the data hashing field to tasks involving cross-modal data processing and information retrieval. The ROHLSE model [31] focuses on addressing label noise and exploiting semantic correlations in processing large-scale streaming data. This work presents an innovative approach for hashing streaming data. The DAZSH model [32] introduces a hashing method tailored to the zero-shot problem in cross-modal retrieval. Integrating data features with class attributes effectively captures relationships between known and unknown categories, facilitating the transfer of supervised knowledge. Moreover, a neural network-based approach [33] is designed to learn specific category centers and guide the hashing of multimedia data. Finally, the SKDCH model [34] proposes a semi-supervised knowledge distillation method for cross-modal hashing. It mitigates heterogeneity gaps and enhances discriminability by improving the triplet ranking loss. These studies collectively demonstrate the application of data hashing principles to tackle complex challenges in cross-modal data processing and information retrieval.
Recently, the CLIP model [22] utilized a rich text-image dataset to create a joint text-visual model, which the authors of the CLIP4CLIP model [6] leveraged through transfer learning to achieve state-of-the-art results in video retrieval tasks. In several studies based on the CLIP model [35], the model outperformed most other works [2,12,36], even in a zero-shot manner, showcasing its excellent generalization capabilities in text-video understanding.
Several video feature aggregation methods, including average pooling, self-attention, and multi-modal transformers [4,6], are commonly used in CLIP-based studies and have been shown to match text and images effectively. However, there needs to be more research specifically focused on matching video sub-regions with words [49]. As noted in the previous section, many video frames are semantically irrelevant to the text in matching processes. Thus, using a cross-modal conditional attention mechanism to reduce the impact of redundant frames on retrieval results is the motivation behind this paper's research.
In natural language processing, attention mechanisms are widely used to filter redundant information [37]. Similarly, attention mechanisms have been used to enhance the focus on visual and textual local features in cross-modal information-matching tasks. Some researchers have proposed a similarity attention filtration (SAF) module [38] based on attention mechanisms to match images with text. This module applies attention mechanisms to cross-modal feature alignment, aiming to eliminate the interference caused by redundant text-image pairs and enhance image retrieval accuracy. Owing to the remarkable performance of attention mechanisms in the cross-modal domain, certain researchers [39] have developed more intricate bidirectional focused attention networks, building upon this foundation to enhance matching accuracy further. Concurrently, other scholars [40,41] have introduced a recurrent attention mechanism to investigate the correspondence between fine-grained text regions and individual words.
The crucial aspect of implementing the attention model in text-video cross-modal inference lies in embedding the features of both text and video and subsequently identifying frames that align more effectively with text semantics, as demonstrated by Tan et al. [28]. We have incorporated a textual conditional attention module into our cross-modal matching model to achieve this. This module filters out extraneous semantic information within the frames by computing attention weights for each frame, using text semantics as a conditional projection.
Text-video retrieval can be defined as two tasks: one is retrieving semantically close text by the given video information as the input, named t2v. The other is retrieving semantically similar videos by the sentence given as the input, named v2t. Taking the t2v task as an example, a query text and a set of video sets to be queried are the input data. The model calculates the similarity score between the query text and each video in the video set and finds the video with the best semantic match to return. Similarly, v2t has a similar task. This paper mainly focuses on the t2v task as the leading study. We are dedicated to enhancing the accuracy of text-video retrieval tasks by implementing two pivotal strategies: filtering out irrelevant frames and aggregating key-frames to construct video features, followed by performing a global-local multi-modal feature matching approach.
Figure 2 illustrates the framework of our model for the text-video retrieval task. The text-video retrieval task is quantified into three main components: Data Embedding, Cross-modal Feature Extraction, and Similarity Calculation. In the Data Embedding phase, we feed the input data (including words and frames) into the text encoder ψ and the image encoder ϕ of the CLIP model, obtaining embedded data representations. The Cross-modal Feature Extraction section encompasses two critical steps. Firstly, we employ a self-attention mechanism to extract sentence features from the text. Secondly, we utilize a conditional attention mechanism to filter out redundant and aggregate frames semantically relevant to the text, thereby obtaining more precise video features. In the Similarity Calculation phase, we compute similarity at global and local granularities (i.e., video-sentence and frame-word features) to consider thematic and detail features during the text-video matching process. It is worth noting that the Cross-modal Feature Extraction and Similarity Calculation sections contain two innovative modules introduced in this paper, which are detailed as follows:
Cross-modal Conditional Attention Aggregation Module To process text input t, we pass it through a text encoder ψ to obtain its word embedding Ew. This embedding is then multiplied with the weight matrix query projection WQ, to produce the text query vector Qt. For video input v, it is passed through a video encoder to produce frame embedding Ef. This embedding is then multiplied with the key projection matrix WK and the value projection matrix WV, respectively, to obtain the key embedding of the frames Kv, and the value embedding of the frames Vv. Then we calculate the attention score of the video frames watt, by taking the dot product of Qt and Kv. The attention scores are used to weight the value vectors of the video frames Vv, to produce the self-attention frame feature embedding.
Global-Local Similarity Matching Module proposes a cross-similarity calculation module to perform the text-video matching task. The module integrates cosine similarity and conditional probability models to compute the similarity scores between the different modal data feature embeddings, considering their mutual dependence. The global feature data (text-video) and local feature data (word-frame) are fed into the model separately, producing their similarity scores. The model then aggregates the global and local similarity scores through self-attention to obtain the final matching scores.
In this section, we concentrate on the methods for implementing the model presented in the paper. To facilitate a comprehensive understanding of our model, we commence by elucidating the procedure for utilizing the pre-trained CLIP model to encode text and video in Section 4.1. Subsequently, the following two sections introduce pivotal functional components of our model: the Cross-modal Conditional Attention Aggregation Module (Section 4.2) and the Global-Local Similarity Matching Module (Section 4.3). Section 4.2 describes the method for incorporating attention mechanisms into cross-modal feature aggregation to enhance the relevance of video features to text semantics. In Section 4.3, we highlight the limitations of traditional similarity computation method for cross-modal feature matching and propose a novel method for computing the global-local similarity of correlated cross-modal features. Finally, we present the implementations of training objectives with both the two modules in Section 4.4.
The video can be considered a sequence of images, with each video frame being an individual image. In this study, many pre-trained models have been found to extract features from text and images effectively, enabling cross-modal semantic understanding [6,22]. These models have been pre-trained on large and diverse datasets, allowing us to leverage their excellent performance in feature extraction to simplify the training process of our work.
CLIP models trained on large, richly typed datasets have demonstrated exceptional feature extraction abilities and robust performance in downstream tasks. Numerous studies have shown that CLIP performs well in extracting the rich semantic features of input information [22]. In the task of video feature extraction, individual video frames are embedded in CLIP's joint latent space as images. The video features are obtained by aggregating the embedded features of the individual frames. In this paper, we learn a new joint latent space based on the CLIP model to serve as an encoder for our standard video-text feature extraction.
Given text t and video v as inputs, we first preprocess the video into quantifiable frames vfn and input these frames into the CLIP model as images. CLIP then outputs a text embedding Et and a frame embedding Efnv as encodings. By aggregating the sequence of frame embeddings SF, we can obtain the video embedding EV:
Et=ψ(t)∈Rd | (4.1) |
Efnv=ϕ(vfn)∈Rd | (4.2) |
SetF=[Ef1v,Ef2v,⋯,Efnv]∈Rn×d | (4.3) |
where ψ is CLIP's text encoder, and ϕ is CLIP's image encoder. SetF is the set of frames feature embedding.
Then we can obtain the video's feature embedding by a temporal aggregation function ρ:
EV=ρ(SetF) | (4.4) |
Obviously, Et and Efnv are the two outputs of CLIP.
Previous research has typically used average pooling or self-attention mechanism when calculating the video embedding by aggregating the frame embeddings [12,29]. However, this approach results in a video embedding that contains many redundant visual features that need to be more relevant to the semantic features of the text. This is because the text has much less semantic information than the video. As a result, these aggregate methods can negatively impact the accuracy of the final similarity computation results.
The aggregation of frame features to obtain the video embedding for use in the similarity calculation model often results in the inclusion of redundant visual features that need to be more relevant to the semantic features of the text. This can negatively impact the accuracy of the final similarity computation results.
This module uses the attention mechanism to extract the video features. We combine the semantic text features to compute the attention weights for the keyframes. This enhances the crucial information in the frames and filters out redundant information, resulting in video features. Firstly, we project the text embedding Et as a query vector Qt∈R1×da. The video embedding obtained from Section 4.1 is then projected as a key vector KF∈R1×da and a value vector VF∈R1×da through dot product operations with matrices WK∈Rd×da and WV∈Rd×da, respectively. The calculations are defined as follows:
Qt=WQ⋅Et | (4.5) |
KF=WK⋅SetF | (4.6) |
VF=WV⋅SetF | (4.7) |
where WQ, WK and WV are the parameter matrices obtained from the neural network training.
Finally, by utilizing the cross-modal attention feature aggregation module, we obtain the joint text-video semantic attention scores for each frame, represented as Sfn.
SV=[Sf1,Sf2,⋯,Sfn]=softmax(QtKTv√da)Vv | (4.8) |
The above equation is the main idea of the aggregation function ρ, and the input video features embedding EV can finally be calculated as follows:
EV=Sf1Ef1+Sf2Ef2+⋯+SfnEfn | (4.9) |
In Section 4.1, the CLIP encoder obtains the text feature embedding Et and the set of frame feature embeddings SetF. Section 4.2 then leverages the attention mechanism to aggregate the frame embeddings and get the text-conditional video embedding Ev. Although this approach incorporates semantic text features into the video feature embedding, conventional similarity computation models, such as cosine similarity, can only improve the matching accuracy to a certain extent. It may still need to look at the local semantics expressed in specific keyframes. This section considers the consistency of structure and text word features in semantic expression to address this issue. It combines the similarity computation of both video and sentences to perform text-video matching.
Vector Similarity Function The previous methods of calculating the similarity between features of two different modal data often relied on cosine or Euclidean distance [40]. While these methods can capture relevance to a degree, they cannot detect finer local correspondences between the vectors. Our proposed similarity representation function aims to address this issue by leveraging the local features of the vectors and using cosine similarity calculation as the core component. This enables a more in-depth analysis of the correlation information between the feature representations from different modalities. The similarity function is formulated as follows:
f(α1,α2;Wsim)=Wsim|α1−α2|2‖Wsim|α1−α2|2‖2 | (4.10) |
where ‖α1−α2‖2 is the square operation of each element in the result α1−α2, and ‖Wsim|α1−α2|2‖2 is the l2− operation of Wsim|α1−α2|2. The Wsim in the equation is a learnable parameter matrix to obtain the similarity vector.
Text-Video Global Similarity Calculation According to the similarity Eq (4.10), we replace α1 and α2 with the text feature embedding Et and the video feature embedding EV, respectively.
Simg=f(EV,Et;Wg)=Wg|EV−Et|2‖Wg|EV−Et|2‖2 | (4.11) |
where Wg is the parameter matrix that aims to learn the global similarity through training.
Frame-Text Local Similarity Calculation To exploit the local semantic information in frames, we propose a similarity calculation regarding the similarity between the video's local frames and words.
First, we obtain the cosine similarity Cij of the frame feature vector vi and the word vector tj:
Cij=vTi⋅tj‖vi⋅tj‖ | (4.12) |
Then, softmax is used to normalize the cosine similarity to obtain the local feature weights βij.
βij=max(0,Ci,j)√∑ni=1(max(0,Ci,j)2 | (4.13) |
After obtaining the attention weights, we calculate the frames feature representation containing the words' semantic information:
Vfi=n∑i=1βijvi | (4.14) |
Finally, we compute the frame-text local similarity representation between Vfi and tj using Eq (4.10):
simlj=f(Vfi,tj;Wl) | (4.15) |
where Wl is also the parameter matrix like Wg.
Local similarity represents the association between capturing a specific word and the frames that make up the video, using finer-grained visual semantic alignment to improve similarity prediction.
We take the widely used ranking loss function [42] as the training objective in our cross-modal retrieval task. Its goal is to evaluate the relative distance between input samples and optimize model training by incorporating the similarity calculation results into the ranking loss. The similarity computation model is defined as sim(), with positive samples (V,T) being the matched video-text pairs and the negative samples being mismatched pairs:
V′=argmaxr≠v(r,T) | (4.16) |
T′=argmaxw≠T(V,w) | (4.17) |
The loss is obtained referring to the ranking loss function:
Loss=ω1Lossloc+ω2Lossglo | (4.18) |
where:
Lossl(va,vp,vn)=∑max(0,s(va,vp)−s(va,vn)+α) | (4.19) |
Lossg(va,vp,vn)=max(0,s(va,vp)−s(va,vn)+α) | (4.20) |
where va is the anchor sample, representing the reference vector. vp is the sample I or T that matches the reference sample. vn is the sample I′ or T′ that does not match the reference sample. Vector parameters in the Lossl function refer to the frame or text local feature vectors. Vector parameters within the Lossg function refer to the video and text local feature vectors.
To validate the effectiveness of our model, in this section, we demonstrate experiments on four widely used text-video retrieval datasets: MSR-VTT [9], LSMDC [44], MSVD [43] and DiDeMo [12]. The model's performance is evaluated by testing its performance in terms of different recall rates, ranking results, and comparing the results with experimental results from existing studies.
MSR-VTT dataset was created by collecting 257 popular video queries from a commercial search engine, with each query including 118 videos. The current version of MSR-VTT offers 10,000 web video clips, totaling 41.2 hours and 200,000 clip-sentence pairs, and each video is annotated with approximately 20 captions. To compare with previous work, 7000 videos were selected for training [13], and 1000 videos were selected for testing [43], following the commonly used segmentation method in current studies. Since no validation set was provided, 1000 videos were randomly selected from MSR-VTT to form the validation set.
LSMDC dataset comprises 118,081 video clips extracted from 202 movies, ranging from two to 30 seconds. The validation set includes 7,408 clips, and the evaluation is performed on a separate test set consisting of 1000 videos from movies that are distinct from those in the training and validation sets.
MSVD dataset comprises 1970 videos ranging from 10 to 25 seconds, and each video is annotated with 40 captions. The videos feature various subjects, including people, animals, actions, and scenes. Each video was annotated by multiple annotators, with approximately 41 annotated sentences per clip and a total of 80,839 sentences. The standard splitting [6] was used, with 1,200 videos for training, 100 videos for validation, and 670 videos for testing.
DiDeMo dataset comprises 10,000 flickr videos, each annotated with 40,000 sentences. In the test set, there are 1000 videos. As per the approach in references, we assess paragraph-to-video retrieval, wherein all sentence descriptions for a video are concatenated to form a single query. Notably, this dataset includes localization annotations (ground truth proposals), and our reported results incorporate these ground truth proposals.
Data Pre-processing. Different datasets have varying video durations and frame sizes, making standardizing the model input format challenging. This study extracts 12 frames from each video according to a specified time window to resolve this issue. It uses them as representatives of the video content, ensuring a uniform input shape for the model. Additionally, to ensure consistency with previous work [2,6,12] and facilitate testing, the pixel size of each video frame was adjusted to 224×224.
Model Settings. The study employs the CLIP model as its backbone and initializes all encoder parameters based on the pre-trained weights of the CLIP model, as described in [22]. For each video, the ViT-B/32 image encoder of the CLIP model is used to obtain the frame embeddings, while the transformer text encoder of the CLIP model is used to obtain the text embeddings. The CLIP encoder has an output size of 512, which also determines the attention size of the three projection dimensions, which is set to 512. The weight matrices Wq, Wk, and Wv are randomly initialized, and the bias values are set to 0. The output units of the fully connected layer are also set to 512, and a dropout of 0.3 is applied, as described in [45]. The study employs the Adam optimizer [46] for training, with an initial learning rate of 0.00002, and the learning rate is decayed using a cosine schedule, as described in [22].
The recall [12,29] represents the ratio of the valuable fraction in the detection results to that in the dataset. Recall at K was used to measure the model's performance, and recall at 1 ( R@1 ), recall at 5 ( R@5 ), and recall at 10 ( R@10 ) were used as evaluation metrics during testing.
In this section, we present the results of the retrieval performance of our model on the MSR-VTT, LSMDC, MSVD and DiDeMo datasets. The aim is to showcase the superiority of our model in comparison to other existing models.
Table 1 presents the results of comparative experiments in text-to-video retrieval (R@1/5/10) across four widely utilized public datasets.
Method | MSR-VTT | LSMDC | MSVD | DiDeMo | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
CE [22] | 20.9 | 48.8 | 62.4 | 11.2 | 26.9 | 34.8 | 19.8 | 49 | 63.8 | 16.1 | 41.1 | - |
MMT [29] | 26.6 | 57.1 | 69.6 | 12.9 | 29.9 | 40.1 | - | - | - | - | - | - |
Frozen [12] | 31 | 59.5 | 70.5 | 15.0 | 30.8 | 39.8 | 33.7 | 64.7 | 76.3 | 34.6 | 65.0 | 74.7 |
HIT-pretrained [47] | 30.7 | 60.9 | 73.2 | 14.0 | 31.2 | 41.6 | - | - | - | - | - | - |
MDMMT [30] | 38.9 | 69.0 | 79.7 | 18.8 | 39.5 | 47.9 | - | - | - | - | - | - |
All-in-one [48] | 37.9 | 68.1 | 77.1 | - | - | - | - | - | - | 32.7 | 61.4 | 73.5 |
ClipBERT [49] | 22.0 | 46.8 | 59.9 | - | - | - | - | - | - | 20.4 | 48.0 | 60.8 |
CLIP-straight [35] | 31.2 | 53.7 | 64.2 | 11.2 | 22.7 | 29.2 | 37 | 64.1 | 73.8 | - | - | - |
CLIP2Video [50] | 30.9 | 55.4 | 66.8 | - | - | - | 47.0 | 76.8 | 85.9 | - | - | - |
Singularity [51] | 42.7 | 69.5 | 78.1 | - | - | - | - | - | - | 53.1 | 79.9 | 88.1 |
LAVENDER [52] | 40.7 | 66.9 | 77.6 | 26.1 | 46.4 | 57.3 | 46.3 | 76.9 | 86.0 | 53.4 | 78.6 | 85.3 |
CLIP4Clip-meanP [53] | 43.1 | 70.4 | 80.8 | 20.7 | 38.9 | 47.2 | 46.2 | 76.1 | 84.6 | 43.4 | 70.2 | 80.6 |
CLIP4Clip-seqTransf [53] | 44.5 | 71.4 | 81.6 | 22.6 | 41.0 | 49.1 | 45.2 | 75.5 | 84.3 | 42.8 | 68.5 | 79.2 |
VINDLU [54] | 46.5 | 71.5 | 80.4 | - | - | - | - | - | - | 61.2 | 85.8 | 91.0 |
ours | 45.3 | 72.5 | 81.3 | 26.5 | 47.1 | 57.4 | 47.6 | 77.2 | 86.0 | 60.7 | 86.1 | 92.2 |
Comparing our method's results with existing approaches, we observe that on the MSR-VTT, LSMDC, MSVD, and DiDeMo datasets, our average accuracy rates are 66.4% (+ 0.3%), 43.7% (+ 0.4%), 70.3% (+ 0.2%) and 79.7% (+ 0.4%), respectively. These scores surpass the performance of the models listed in the table across all four datasets, thus validating the effectiveness of the approach presented in this paper.
More accurately, on the LSMDC and DiDeMo datasets, we observed that our model's R@1 results were lower than those of the VINDLU model. Upon analysis, it was discovered that the VINDLU model focuses on effective video-and-language pretraining, utilizing the jointly trained CC3M + WebVid2M dataset containing content domains that are more aligned with MSR-VTT, such as sports, news, and human actions. Consequently, the VINDLU model outperforms our model on the R@1 metric. However, due to our model's enhancements in capturing video themes and details, our overall performance excels over VINDLU on the R@5 and R@10 metrics.
Additionally, it is worth noting that only on the MSR-VTT dataset, the R@10 results of the CLIP4Clip-seqTransf model are slightly higher than our model's results. On all other datasets and metrics, our model outperforms CLIP4Clip-seqTransf. Therefore, it can be considered that our model exhibits better stability in terms of performance compared to CLIP4Clip-seqTransf. Since both CLIP4Clip-seqTransf and our method use CLIP as the backbone, we can attribute the improvement in model performance to the fact that CLIP4Clip-seqTransfer employs a text-agnostic visual feature extraction approach, whereas our model utilizes a frame feature aggregation approach conditioned on text semantics.
Furthermore, on the LSMDC dataset, the retrieval task is more challenging due to the inherently vague textual descriptions of movie scenes. This conclusion can be drawn from the generally lower retrieval scores achieved by previous methods on this dataset. However, our approach outperforms the models listed in the table across all metrics. This demonstrates the significance of our model's ability to aggregate video features conditioned on text semantics. It learns the features of frames most relevant to the text semantics and suppresses the interference of redundant frames in feature aggregation.
In this section, a series of ablation experiments are conducted to explore the two modules' effects to understand the model's advantages.
Module 1. The embedding module for video feature acquisition, which utilizes a cross-modal attention mechanism to aggregate frame features.
Module 2. The global-local similarity-based computation module.
The comparison experiments were performed on the MSR-VTT dataset.
Cross-modal Feature Aggregation
Table 2 presents the results of the ablation study on the cross-modal feature aggregation module for video feature extraction. The different configurations for the ablation experiments are shown in the table.
Test Model | Aggregation Method | Result | ||||||
Mean | Self-Att | Cross-modal | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 41.8 | 70.9 | 83.5 | 3.0 | 13.7 | ||
2 | \checkmark | 45.3 | 74.5 | 84.7 | 2.0 | 12.3 | ||
3 | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |
In this set of experiments, we compare the performance of our cross-modal aggregation method with that of Mean Aggregation and Self-attention Aggregation. The Mean Aggregation method calculates an unweighted average of the frame feature embeddings, while the Self-attention Aggregation method computes aggregation weights without utilizing textual semantic information and aggregates the frame features using a focused mechanism.
The results of these experiments, as shown in Table 3, reveal an improvement in R@1 values ranging from 1% to 6%. This indicates that our cross-modal attention-based approach to acquiring video features leads to a more accurate capture of the relationships between video frames and text semantics.
Text Model | Similarity Computation Method | Result | |||||
Local | Global | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 45.2 | 75.2 | 85.6 | 3.0 | 11.5 | |
2 | \checkmark | 44.8 | 73.7 | 84.5 | 3.0 | 12.2 | |
3 | \checkmark | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |
Global-Local Similarity Calculation
In the ablation experiments of the similarity calculation module, Table 3 demonstrates the impact of various strategies on similarity analysis and score prediction. The results indicate that using video features obtained from the cross-modal attention feature aggregation method (as outlined in Section 4.3) as input data for the similarity calculation module slightly decreases performance compared to using frame-word local features. This suggests two things: (1) the aggregation process may result in a loss of detailed features, and (2) the slight performance decrease also implies that the aggregated video features can effectively capture the features present in the frames. The global-local similarity calculation approach leads to an improvement of 1–3% in R@1 compared to using either method individually.
Figure 3 displays the attention weights of selected video frames generated by the cross-modal feature aggregation model. As can be observed from the examples, the model's attention mechanism can distinguish the relative importance of each frame's content, assigning lower weights to frames with limited correlation to textual information. In comparison, the self-attention aggregation method can recognize frames with crucial information but fails to differentiate between frames with subtle differences. On the other hand, the mean weighting aggregation method doesn't differentiate between frame.
The line graph in Figure 4 showcases the trend of the weight assigned to the key frames of the first example shown in Figure 3. The results demonstrate that the cross-modal Attention mechanism effectively identifies the frames relevant to the critical information in the video as it assigns higher weights to these frames. On the other hand, the mean aggregation method presents a flat trend, with no significant fluctuations in the weight assignments. In comparison, the self-attention method appears less responsive to the changes in the frame content, leading to a more moderate trend in the graph.
The results in Figure 5 show the effectiveness of the text-to-video model developed in this study. The first row displays the input query text, while the second shows the ground truth. The remaining rows (3–5) present each query's top 1–3 ranked results. The retrieved video frames are visually similar to the ground truth and semantically align with the given text query, demonstrating the ability of the model to match textual and visual information.
The first column in Figure 5 demonstrates the model's aptitude in retrieving videos accurately related to the query text. The query "doing craft", is reflected in the captions of the retrieved videos, all of which pertain to "craft" and feature a "woman". This indicates that the model can efficiently match text and video topics during retrieval. The second column showcases the model's focus on the critical elements shared between the text and video modalities, as the top-ranked retrieval result, despite not being the ground truth, contains the crucial information from the query, namely a "woman" and a "laptop". Similarly, both the top 2 and top 3 ranked videos in the last column depict a "student" and a "teacher" in a "classroom".
The utilization of cross-modal feature aggregation and global-local similarity calculation in the model elevates the accuracy and sophistication of text-to-video retrieval results. This allows the model to concentrate on the topics and visual aspects of the videos, resulting in a more precise and refined retrieval outcome.
This paper improves the performance of text-video matching by implementing two modules: the cross-modal attention feature aggregation module and the global-local similarity calculation module. The cross-modal attention feature aggregation module leverages the pre-trained CLIP model's multi-modal feature extraction capabilities to extract highly relevant video features, focusing on the frames most pertinent to the text. Meanwhile, the global-local similarity calculation module calculates similarities based on the video-sentence and frame-word granularities, allowing for a more nuanced consideration of both the topic and detail features in the matching process. The experimental results, conducted on the benchmark dataset, clearly demonstrate the efficacy of our proposed modules in capturing both topic and detail features, leading to improvement in text-video matching accuracy. This work contributes to multi-modal representation learning, highlighting the potential of advanced feature aggregation and similarity calculation techniques in enhancing text-video matching. Further research may be necessary to realize our methods in real-world applications fully.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
The authors would like to acknowledge the support provided by Aerospace HongKa Intelligent Technology (Beijing) CO., LTD.
The authors declare there is no conflict of interest.
[1] | Altenburger R, Ait-Aissa S, Antczak P, et al. (2015) Future water quality monitoring—adapting tools to deal with mixtures of pollutants in water resource management. Sci Total Environ 512: 540-551. |
[2] | Chezgi J, Pourghasemi HR, Naghibi SA, et al. (2015) Assessment of a spatial multi-criteria evaluation to site selection underground dams in the Alborz Province, Iran. Geocarto Int 31: 628-646. |
[3] | Oh HJ, Kim YS, Choi JK, et al. (2011) GIS mapping of regional probabilistic groundwater potential in the area of Pohang City, Korea. J Hydrol 399: 158-172. |
[4] | Nampak H, Pradhan B, Manap MA (2014) Application of GIS based data driven evidential belief function model to predict groundwater potential zonation. J Hydrol 513: 283-300. |
[5] | Tweed SO, Leblanc M, Webb JA, et al. (2007) Remote sensing and GIS for mapping groundwater recharge and discharge areas in salinity prone catchments, southeastern Australia. Hydrogeol J 15: 75-96. |
[6] | Ozdemir A (2011a) Using a binary logistic regression method and GIS for evaluating and mapping the groundwater spring potential in the Sultan Mountains (Aksehir, Turkey). J Hydrol 405: 123-136. |
[7] | Ozdemir A (2011b) GIS-based groundwater spring potential mapping in the Sultan Mountains (Konya, Turkey) using frequency ratio, weights of evidence and logistic regression methods and their comparison. J Hydrol 411: 290-308. |
[8] | Manap MA, Nampak H, Pradhan B, et al. (2014) Application of probabilistic-based frequency ratio model in groundwater potential mapping using remote sensing data and GIS. Arabian J Geosci 7: 711-724. |
[9] | Pourtaghi ZS and Pourghasemi HR (2014) GIS-based groundwater spring potential assessment and mapping in the Birjand Township, southern Khorasan Province, Iran. Hydrogeology J 22:643-662. |
[10] | Naghibi SA, Pourghasemi HR, Pourtaghi ZS, et al. (2015) Groundwater qanat potential mapping using frequency ratio and Shannon's entropy models in the Moghan watershed, Iran. Earth Sci Inform 8: 171-186. |
[11] | Pourghasemi HR, Beheshtirad M (2015) Assessment of a data-driven evidential belief function model and GIS for groundwater potential mapping in the Koohrang Watershed, Iran. Geocarto Int 30: 662-685. |
[12] | Naghibi SA, Pourghasemi HR (2015) A comparative assessment between three machine learning models and their performance comparison by bivariate and multivariate statistical methods in groundwater potential mapping. Water Res Manage 29: 5217-5236. |
[13] | Al-Abadi A, M Pradhan B, Shahid S (2015) Prediction of groundwater flowing well zone at An-Najif Province, central Iraq using evidential belief functions model and GIS. Environ Monit Assess 188:549. |
[14] | Naghibi SA, Pourghasemi HR, Dixon B (2016) GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran. Environ Monit Assess 188: 1-27. |
[15] | Naghibi S A, Dashtpagerdi MM (2016) Evaluation of four supervised learning methods for groundwater spring potential mapping in Khalkhal region (Iran) using GIS-based features. Hydrogeol J 1-21. |
[16] | Zabihi M, Pourghasemi, HR, Pourtaghi ZS, et al. (2016) GIS-based multivariate adaptive regression spline and random forest models for groundwater potential mapping in Iran. Environ Earth Sci 75: 1-19. |
[17] | Naghibi SA, Pourghasemi HR. , Abbaspour K (2017). A comparison between ten advanced and soft computing models for groundwater qanat potential assessment in Iran using R and GIS. Theoretical and Applied Climatology. |
[18] | Mennis J and Guo D (2009) Spatial data mining and geographic knowledge discovery—an introduction. Comput Environ Urban Syst 33:403-408. |
[19] | Tien Bui D, Le K-T, Nguyen V (2016a) Tropical forest fire susceptibility mapping at the Cat Ba National Park Area, Hai Phong City, Vietnam, using GIS-based Kernel logistic regression. Remote Sens 8: 347. |
[20] | Pacheco, Fernando, Ana Alencoão (2002) "Occurrence of springs in massifs of crystalline rocks, northern Portugal." Hydrogeol J 10.2: 239-253. |
[21] | Pacheco FAL and Vander Weijden CH (2014). Modeling rock weathering in small watersheds. J Hydrol 513C 13-27. |
[22] | Iranian Department of Water Resources Management (2014) weather and climate report, Tehran province. Available from: http://www.thrw.ir/. |
[23] | Dehnavi A, Aghdam IN, Pradhan B, et al. (2015) A new hybrid model using step-wise weight assessment ratio analysis (SWARA) technique and adaptive neuro-fuzzy inference system (ANFIS) for regional landslide hazard assessment in Iran. Catena 135: 122-148. |
[24] | Conforti M, Pascale S, Robustelli G, et al. (2014) Evaluation of prediction capability of the artificial neural networks for mapping landslide susceptibility in the Turbolo River catchment (northern Calabria, Italy). Catena 113: 236-250. |
[25] | Oh H-J and Pradhan B (2011) Application of a neuro-fuzzy model to landslide-susceptibility mapping for shallow landslides in a tropical hilly area. Comput Geosci 37: 1264-1276. |
[26] | Wilson JP and Gallant JG (2000) Terrain Analysis Principles and Applications. John Wiley and Sons, Inc. , New-York, 479. |
[27] | Moore ID and Burch GJ (1986) Sediment Transport Capacity of Sheet and Rill Flow: Application of Unit Stream Power Theory. Water Resour Res 22: 1350-1360. |
[28] | Pourghasemi H, Pradhan B, Gokceoglu C, et al. (2013) A comparative assessment of prediction capabilities of Dempster-Shafer and weights-of-evidence models in landslide susceptibility mapping using GIS. Geomatics Nat Hazards Risk 4: 93-118. |
[29] | Moore ID, Grayson RB, Ladson AR (1991) Digital terrain modelling: A review of hydrological, geomorphological, and biological applications. Hydrol Process 5: 3-30. |
[30] | Ayalew L and Yamagishi H (2005) The application of GIS-based logistic regression for land-slide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan. Geomorphol 65: 15-31. |
[31] | Pradhan B, Abokharima MH, Jebur MN, et al. (2014) Land subsidence susceptibility mapping at Kinta Valley (Malaysia) using the evidential belief function model in GIS. Nat Hazards. |
[32] | Iranian forest, rangeland and watershed management organization. 2014. Available from: http://www.frw.org.ir/00/En/default.aspx. |
[33] | Geology Survey of Iran (GSI) (1997) Geology map of the Chaharmahal-e-Bakhtiari Province. http://www.gsi.ir/Main/Lang_en/index.html.AccessedSeptember2000. |
[34] | Dormann CF, Elith J, Bacher S, et al. (2013) Collinearity: a review of methods to deal with it and a simula-tion study evaluating their performance. Ecogr 36: 27-46 |
[35] | Hair JF, Black WC, Babin BJ, et al. (2009) Multivariate data analysis. Prentice Hall, New York. |
[36] | Keith TZ (2006) Multiple regressions and beyond. Pearson, Boston |
[37] | Lee S and Talib JA (2005) Probabilistic landslide susceptibility and factor effect analysis. Environ Geol 47: 982-990 |
[38] | Yilmaz I (2007) GIS based susceptibility mapping of karst depression in gypsum: a case study from Sivas basin (Turkey). Eng Geol 90: 89-103 |
[39] | Bonham-Carter G (1994) Geographic information systems for geoscientists modelling with GIS. Pergamon. |
[40] | Lee S and Pradhan B (2006) Probabilistic landslide hazards and riskmapping on Penang Island, Malaysia. Earth Sys Sci 115: 661-667 |
[41] | Lombardo L, Cama M, Conoscenti C, et al. (2015) Binary logistic regression versus stochastic gradient boosted decision trees in assessing landslide susceptibility for multiple-occurring landslide events: application to the 2009 storm event in Messina (Sicily, southern Italy). Nat Hazards 79: 1621-1648. |
[42] | Abeare SM (2009) Comparisons of boosted regression tree, glm and gam performance in the standardization of yellowfin tuna catch-rate data from the gulf of mexico lonline fishery. |
[43] | Friedman J (2001) Greedy boosting approximation: a gradient boosting machine. Ann Stat 29: 1189-1232. |
[44] | Schapire RE (2003) The boosting approach to machine learning: an overview. Nonlinear Est Class 171: 149-171. |
[45] | Sutton CD (2005) 11-Classification and Regression Trees, Bagging, and Boosting. Handbook of statistics 24: 303-329 |
[46] | Kuhn M, Wing J, Weston S, et al. (2015) Package "caret". Available from: http://caret.r-forge.r-project.org. |
[47] | Elith J, Leathwick JR, Hastie T (2008) A working guide to boosted regression trees. J Animal Ecol 77: 802-813 |
[48] | Leathwick JR, Elith J, Francis MP, et al. (2006). Variation in demersal fish species richness in the oceans surrounding New Zealand: An analysis using boosted regression trees. Mar Ecol Prog Ser 321: 267-281. |
[49] | Schillaci C, Lombardo L, Saia S, et al. (2017) Modelling the topsoil carbon stock of agricultural lands with the Stochastic Gradient Treeboost in a semi-arid Mediterranean region. Geoderma 286: 35-45. |
[50] | R Development Core Team (2006) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Available from: http://www.R-project.org |
[51] | Ridgeway G (2015) Package "gbm. " |
[52] | Guzzetti F, Reichenbach P, Ardizzone F, et al. (2006) Estimating the quality of landslide susceptibility models. Geomorphol 81: 166-184 |
[53] | Umar Z, Pradhan B, Ahmad A, et al. (2014) Earthquake induced landslide susceptibility mapping using an integrated ensemble frequency ratio and logistic regression models in West Sumatera Province, Indonesia. Catena 118: 124-135 |
[54] | Swets JA (1988) Measuring the accuracy of diagnostic systems. Sci 240: 1285-1293 |
[55] | Negnevitsky M (2005) Artificial Intelligence: a guide to intelligent systems. Inf & Comput Sci 48: 284-300. |
[56] | Frattini P, Crosta G, Carrara A (2010). Techniques for evaluating the performance of landslide susceptibility models. Eng Geol 111: 62-72. |
[57] | Lombardo L, Cama M, Maerker M., et al. (2014). A test of transferability for landslides susceptibility models under extreme climatic events: Application to the Messina 2009 disaster. Natural Hazards 74: 1951-1989. |
[58] | Heckman T, Gegg K, Gegg A, et al. (2014) Sample size matters: investigating the effect of sample size on a logistic regression susceptibility model for debris flow. Nat. Hazards Earth Syst. Sci 14: 259-278. |
[59] | Youssef AM, Pourghasemi HR, Pourtaghi ZS, et al. (2015) Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 1-18 |
[60] | Carty DM (2011) An analysis of boosted regression trees to predict the strength properties of wood composites. Arch Oral Biol 60: 45-54. |
1. | Guilherme Deomedesse Minari, Rodolfo Debone Piazza, Daiane Cristina Sass, Jonas Contiero, EPS Production by Lacticaseibacillus casei Using Glycerol, Glucose, and Molasses as Carbon Sources, 2024, 12, 2076-2607, 1159, 10.3390/microorganisms12061159 | |
2. | Amal Zammouri, Manel Ziadi, Adem Gharsallaoui, Imen Fguiri, Imed Sbissi, Mohamed Hammadi, Touhami Khorchani, Characterization of Novel Exopolysaccharides from Weissella cibaria and Lactococcus lactis Strains and Their Potential Application as Bio-Hydrocolloid Agents in Emulsion Stability, 2024, 10, 2311-5637, 532, 10.3390/fermentation10100532 | |
3. | Ramses Cruz-Valencia, Lourdes Santiago-López, Luis Mojica, Adrián Hernández-Mendoza, Belinda Vallejo-Cordoba, Sonia G. Sáyago-Ayerdi, Lilia M. Beltrán-Barrientos, Aarón F. González-Córdova, Effect of Fermented Foods on Inflammatory Bowel Disease: A Focus on the MAPK and NF-kβ Pathways, 2025, 2692-1944, 10.1021/acsfoodscitech.4c00628 | |
4. | Mohamed S. Amer, Khouloud M. Barakat, Hassan A. H. Ibrahim, Koichi Matsuo, Mohamed I. A. Ibrahim, An overview on marine bacterial exopolysaccharides and their industrial applications, 2025, 0732-8303, 1, 10.1080/07328303.2025.2480564 | |
5. | Xinling Zhang, Xiaoxue Wang, Zhongyang Sun, Rongxin Ren, Jinping Ding, Wenjiang Qian, Hongyi Zhao, Jianjun Zhang, Shiwei Bao, Local Sustained‐Release of Triamcinolone‐Acetonide‐Loaded Regenerated Silk Fibroin Formulations for the Inhibition of Scar Hyperplasia in Rabbit Ears, 2025, 113, 1552-4973, 10.1002/jbm.b.35578 | |
6. | Mohammad Hossein Maleki, Milad Daneshniya, Farzaneh Abdolmaleki, Mattia Spano, Investigating the Physicochemical and Antioxidant Properties of Goat Milk Enriched With Rice Extract Fermented by Exopolysaccharide‐Producing Lactic Bacteria for Functional Yogurt Production, 2025, 2025, 2356-7015, 10.1155/ijfo/8008452 |
Method | MSR-VTT | LSMDC | MSVD | DiDeMo | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
CE [22] | 20.9 | 48.8 | 62.4 | 11.2 | 26.9 | 34.8 | 19.8 | 49 | 63.8 | 16.1 | 41.1 | - |
MMT [29] | 26.6 | 57.1 | 69.6 | 12.9 | 29.9 | 40.1 | - | - | - | - | - | - |
Frozen [12] | 31 | 59.5 | 70.5 | 15.0 | 30.8 | 39.8 | 33.7 | 64.7 | 76.3 | 34.6 | 65.0 | 74.7 |
HIT-pretrained [47] | 30.7 | 60.9 | 73.2 | 14.0 | 31.2 | 41.6 | - | - | - | - | - | - |
MDMMT [30] | 38.9 | 69.0 | 79.7 | 18.8 | 39.5 | 47.9 | - | - | - | - | - | - |
All-in-one [48] | 37.9 | 68.1 | 77.1 | - | - | - | - | - | - | 32.7 | 61.4 | 73.5 |
ClipBERT [49] | 22.0 | 46.8 | 59.9 | - | - | - | - | - | - | 20.4 | 48.0 | 60.8 |
CLIP-straight [35] | 31.2 | 53.7 | 64.2 | 11.2 | 22.7 | 29.2 | 37 | 64.1 | 73.8 | - | - | - |
CLIP2Video [50] | 30.9 | 55.4 | 66.8 | - | - | - | 47.0 | 76.8 | 85.9 | - | - | - |
Singularity [51] | 42.7 | 69.5 | 78.1 | - | - | - | - | - | - | 53.1 | 79.9 | 88.1 |
LAVENDER [52] | 40.7 | 66.9 | 77.6 | 26.1 | 46.4 | 57.3 | 46.3 | 76.9 | 86.0 | 53.4 | 78.6 | 85.3 |
CLIP4Clip-meanP [53] | 43.1 | 70.4 | 80.8 | 20.7 | 38.9 | 47.2 | 46.2 | 76.1 | 84.6 | 43.4 | 70.2 | 80.6 |
CLIP4Clip-seqTransf [53] | 44.5 | 71.4 | 81.6 | 22.6 | 41.0 | 49.1 | 45.2 | 75.5 | 84.3 | 42.8 | 68.5 | 79.2 |
VINDLU [54] | 46.5 | 71.5 | 80.4 | - | - | - | - | - | - | 61.2 | 85.8 | 91.0 |
ours | 45.3 | 72.5 | 81.3 | 26.5 | 47.1 | 57.4 | 47.6 | 77.2 | 86.0 | 60.7 | 86.1 | 92.2 |
Test Model | Aggregation Method | Result | ||||||
Mean | Self-Att | Cross-modal | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 41.8 | 70.9 | 83.5 | 3.0 | 13.7 | ||
2 | \checkmark | 45.3 | 74.5 | 84.7 | 2.0 | 12.3 | ||
3 | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |
Text Model | Similarity Computation Method | Result | |||||
Local | Global | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 45.2 | 75.2 | 85.6 | 3.0 | 11.5 | |
2 | \checkmark | 44.8 | 73.7 | 84.5 | 3.0 | 12.2 | |
3 | \checkmark | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |
Method | MSR-VTT | LSMDC | MSVD | DiDeMo | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
CE [22] | 20.9 | 48.8 | 62.4 | 11.2 | 26.9 | 34.8 | 19.8 | 49 | 63.8 | 16.1 | 41.1 | - |
MMT [29] | 26.6 | 57.1 | 69.6 | 12.9 | 29.9 | 40.1 | - | - | - | - | - | - |
Frozen [12] | 31 | 59.5 | 70.5 | 15.0 | 30.8 | 39.8 | 33.7 | 64.7 | 76.3 | 34.6 | 65.0 | 74.7 |
HIT-pretrained [47] | 30.7 | 60.9 | 73.2 | 14.0 | 31.2 | 41.6 | - | - | - | - | - | - |
MDMMT [30] | 38.9 | 69.0 | 79.7 | 18.8 | 39.5 | 47.9 | - | - | - | - | - | - |
All-in-one [48] | 37.9 | 68.1 | 77.1 | - | - | - | - | - | - | 32.7 | 61.4 | 73.5 |
ClipBERT [49] | 22.0 | 46.8 | 59.9 | - | - | - | - | - | - | 20.4 | 48.0 | 60.8 |
CLIP-straight [35] | 31.2 | 53.7 | 64.2 | 11.2 | 22.7 | 29.2 | 37 | 64.1 | 73.8 | - | - | - |
CLIP2Video [50] | 30.9 | 55.4 | 66.8 | - | - | - | 47.0 | 76.8 | 85.9 | - | - | - |
Singularity [51] | 42.7 | 69.5 | 78.1 | - | - | - | - | - | - | 53.1 | 79.9 | 88.1 |
LAVENDER [52] | 40.7 | 66.9 | 77.6 | 26.1 | 46.4 | 57.3 | 46.3 | 76.9 | 86.0 | 53.4 | 78.6 | 85.3 |
CLIP4Clip-meanP [53] | 43.1 | 70.4 | 80.8 | 20.7 | 38.9 | 47.2 | 46.2 | 76.1 | 84.6 | 43.4 | 70.2 | 80.6 |
CLIP4Clip-seqTransf [53] | 44.5 | 71.4 | 81.6 | 22.6 | 41.0 | 49.1 | 45.2 | 75.5 | 84.3 | 42.8 | 68.5 | 79.2 |
VINDLU [54] | 46.5 | 71.5 | 80.4 | - | - | - | - | - | - | 61.2 | 85.8 | 91.0 |
ours | 45.3 | 72.5 | 81.3 | 26.5 | 47.1 | 57.4 | 47.6 | 77.2 | 86.0 | 60.7 | 86.1 | 92.2 |
Test Model | Aggregation Method | Result | ||||||
Mean | Self-Att | Cross-modal | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 41.8 | 70.9 | 83.5 | 3.0 | 13.7 | ||
2 | \checkmark | 45.3 | 74.5 | 84.7 | 2.0 | 12.3 | ||
3 | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |
Text Model | Similarity Computation Method | Result | |||||
Local | Global | R@1 | R@5 | R@10 | MdR | MnR | |
1 | \checkmark | 45.2 | 75.2 | 85.6 | 3.0 | 11.5 | |
2 | \checkmark | 44.8 | 73.7 | 84.5 | 3.0 | 12.2 | |
3 | \checkmark | \checkmark | 47.6 | 77.2 | 86.0 | 2.0 | 10.0 |