Loading [MathJax]/jax/output/SVG/jax.js
Research article Special Issues

A multi-scale method for complex flows of non-Newtonian fluids

  • Received: 27 February 2021 Accepted: 27 October 2021 Published: 19 November 2021
  • We introduce a new heterogeneous multi-scale method for the simulation of flows of non-Newtonian fluids in general geometries and present its application to paradigmatic two-dimensional flows of polymeric fluids. Our method combines micro-scale data from non-equilibrium molecular dynamics (NEMD) with macro-scale continuum equations to achieve a data-driven prediction of complex flows. At the continuum level, the method is model-free, since the Cauchy stress tensor is determined locally in space and time from NEMD data. The modelling effort is thus limited to the identification of suitable interaction potentials at the micro-scale. Compared to previous proposals, our approach takes into account the fact that the material response can depend strongly on the local flow type and we show that this is a necessary feature to correctly capture the macroscopic dynamics. In particular, we highlight the importance of extensional rheology in simulating generic flows of polymeric fluids.

    Citation: Francesca Tedeschi, Giulio G. Giusteri, Leonid Yelash, Mária Lukáčová-Medvid'ová. A multi-scale method for complex flows of non-Newtonian fluids[J]. Mathematics in Engineering, 2022, 4(6): 1-22. doi: 10.3934/mine.2022050

    Related Papers:

    [1] Weiwei Lai, Yinglong Zheng . Speech recognition of south China languages based on federated learning and mathematical construction. Electronic Research Archive, 2023, 31(8): 4985-5005. doi: 10.3934/era.2023255
    [2] Bojian Chen, Wenbin Wu, Zhezhou Li, Tengfei Han, Zhuolei Chen, Weihao Zhang . Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection. Electronic Research Archive, 2024, 32(1): 643-669. doi: 10.3934/era.2024031
    [3] Yazhuo Fan, Jianhua Song, Yizhe Lu, Xinrong Fu, Xinying Huang, Lei Yuan . DPUSegDiff: A Dual-Path U-Net Segmentation Diffusion model for medical image segmentation. Electronic Research Archive, 2025, 33(5): 2947-2971. doi: 10.3934/era.2025129
    [4] Weiyu Li, Hongyan Wang . Dynamics of a three-molecule autocatalytic Schnakenberg model with cross-diffusion: Turing patterns of spatially homogeneous Hopf bifurcating periodic solutions. Electronic Research Archive, 2023, 31(7): 4139-4154. doi: 10.3934/era.2023211
    [5] Min Li, Ke Chen, Yunqing Bai, Jihong Pei . Skeleton action recognition via graph convolutional network with self-attention module. Electronic Research Archive, 2024, 32(4): 2848-2864. doi: 10.3934/era.2024129
    [6] Qing Li, Junfeng He . Pattern formation in a ratio-dependent predator-prey model with cross diffusion. Electronic Research Archive, 2023, 31(2): 1106-1118. doi: 10.3934/era.2023055
    [7] A. M. Elaiw, E. A. Almohaimeed, A. D. Hobiny . Modeling the co-infection of HTLV-2 and HIV-1 in vivo. Electronic Research Archive, 2024, 32(11): 6032-6071. doi: 10.3934/era.2024280
    [8] Wu Zeng, Heng-liang Zhu, Chuan Lin, Zheng-ying Xiao . A survey of generative adversarial networks and their application in text-to-image synthesis. Electronic Research Archive, 2023, 31(12): 7142-7181. doi: 10.3934/era.2023362
    [9] Cunbin An . Global bifurcation and structure of stationary patterns of a diffusive system of plant-herbivore interactions with toxin-determined functional responses. Electronic Research Archive, 2023, 31(4): 2095-2107. doi: 10.3934/era.2023107
    [10] Longkui Jiang, Yuru Wang, Xinhe Ji . Calibrated deep attention model for 3D pose estimation in the wild. Electronic Research Archive, 2023, 31(3): 1556-1569. doi: 10.3934/era.2023079
  • We introduce a new heterogeneous multi-scale method for the simulation of flows of non-Newtonian fluids in general geometries and present its application to paradigmatic two-dimensional flows of polymeric fluids. Our method combines micro-scale data from non-equilibrium molecular dynamics (NEMD) with macro-scale continuum equations to achieve a data-driven prediction of complex flows. At the continuum level, the method is model-free, since the Cauchy stress tensor is determined locally in space and time from NEMD data. The modelling effort is thus limited to the identification of suitable interaction potentials at the micro-scale. Compared to previous proposals, our approach takes into account the fact that the material response can depend strongly on the local flow type and we show that this is a necessary feature to correctly capture the macroscopic dynamics. In particular, we highlight the importance of extensional rheology in simulating generic flows of polymeric fluids.



    In human-human dialogue systems, particularly in scenarios such as speeches, co-speech gestures serve as a crucial means for speakers to convey their intentions through non-verbal behavior [1,2,3]. Psycholinguistic studies indicate that natural body movements, such as arm waving, nodding, and shaking the head, enrich the speaker's viewpoints and foster interactive communication with the audience [1,4]. With the advancements in Artificial Intelligence Generated Content (AIGC), producing natural and diverse co-speech gestures has become one of the key challenges in current generative tasks. Particularly in virtual characters for games and films, creating expressive gestures significantly enhances the experience for players and audiences [5].

    Previous research on co-speech gesture generation was based on rule-based methods [6,7]. Gesture generation systems developed by meticulously designing correspondences between speech and gesture units can produce high-quality gestures. However, these gestures lack diversity and require significant manual effort. With the progress in deep learning models, co-speech gesture generation has shifted towards data-driven approaches. Figure 1 shows a schematic diagram of the synthesized gestures. Researchers [8,9,10,11] have used adversarial training [12] to synthesize gestures, achieving impressive results. However, maintaining a balance between the generator and discriminator is difficult, often resulting in unstable training and mode collapse. Diffusion models [13] have gained widespread attention for their outstanding performance in generative tasks. In the many-to-many mapping scenario of co-speech gesture generation, the diffusion model can learn and approximate complex distributions. Therefore, we employ the latent diffusion model to reduce irregular human motion in audio-driven motion synthesis, resulting in high-quality and diverse co-speech gestures.

    Figure 1.  Visualization of synthesis co-speech gesture.

    These methods [8,9,10] utilize multimodal inputs, such as audio, text, and speaker identity, to train generative models for synthesizing gestures. However, they have not fully explored the impact of useful audio information on gesture synthesis. Both raw audio waveforms and Mel-spectrograms contain rich audio information. Previous work [8,14,15,16] has extracted features from raw audio waveforms only through the decoder's final layer. In contrast, we combine the Mel-spectrogram with the raw audio waveform to achieve more detailed feature extraction across the audio space. In synthesizing long-sequence gestures, early RNN-based models [11,17,18] tend to accumulate errors over time, leading to repetitive and stagnant gestures. Transformer-based models, however, can effectively capture long-term dependencies using positional encoding to retain the sequence order of the input data. Consequently, we utilize a Transformer model [19] to process the fused multimodal data. Additionally, we introduce a cross-dimension attention mechanism to mitigate the redundancy arising from concatenating features of the two audio modalities.

    Our major contributions are as follows:

    1) Using latent diffusion concepts, we establish a powerful Audio to Diffusion Gesture (A2DG) generation pipeline that synthesizes gestures with high quality and diversity. Through extensive comparative experiments and analyses on two public datasets, we demonstrate the superior performance of our method.

    2) To fully explore the joint distribution between audio information and gestures, we propose the Audio Feature Constructor (AFC). It employs a two-stage attention operation to extract features from the Mel-spectrogram, which are then combined with features from the raw audio signal. This approach enhances the model's capacity to learn and utilize relevant audio information.

    3) To eliminate cumulative errors in non-autoregressive tasks, we introduce the Cross Dimension Transformer (CDformer). Additionally, we introduce a cross-dimension attention mechanism that focuses on the spatial and channel dimensions of input modalities, reducing the impact of redundant information on the model.

    The rest of this article is structured as follows: Section 2 shows the work involved in co-speech gesture generation. In Section 3, we introduce the proposed Audio to Diffusion Gesture pipeline. Section 4 is the experimental part. Section 5 gives the conclusion.

    Deep learning-based approaches for co-speech gesture generation primarily rely on three input modalities: Audio, text, and non-linguistic modalities [20]. We focused on audio-driven diffusion-based gesture generation. Therefore, in this section, we discuss audio-driven gesture generation and a diffusion-based motion synthesis mode.

    Hasegawa et al. [21] proposed a set of audio-driven gesture generation methods based on bidirectional LSTM, incorporating time filtering to mitigate the discontinuity in the generated pose sequences. Kucherenko et al. [22] transformed audio input into 3D joint coordinates of gesture sequences while training a speech encoder to reduce the dimensionality of speech for motion representation. However, this approach overlooks the one-to-many relationship between audio and gestures. For example, a person might make different gestures for the same sentence at different times. Ginosar et al. [11] converted 2D spectrograms into 1D signals and employed generative adversarial networks to predict gestures. Ao et al. [23] introduce a co-speech gesture synthesis method using rhythm-based segmentation and hierarchical embeddings to align speech and gestures, achieving superior rhythmic and semantic coherence. Ye et al. [24] proposed an end-to-end flow-based model without style labels, combining a global encoder and gesture perceptual loss to generate natural gestures. Liu et al. [25] introduce BEAT, a large-scale motion capture dataset with semantic and emotional annotations, and propose a cascaded network (CaMN) for multi-modal gesture synthesis. Yi et al. [26] proposed a novel approach for generating realistic 3D body motions, hand gestures, and facial expressions directly from speech. The method leverages a new dataset and an innovative speech-to-motion framework that independently models facial expressions and body-hand movements. Qian et al. [16] encoded Mel-spectrograms into template vectors to reduce uncertainty in synthesized poses. Our method conditions the generation of co-speech gestures on key features of both audio waveforms and Mel-spectrograms.

    Recently, diffusion models [13,27,28] have made remarkable strides in generative modeling tasks, owing to techniques involving forward noise injection and reverse denoising. While these approaches may demand substantial computational resources, the generated samples demonstrate high quality and diversity. Zhang et al. [29] and Chen et al. [30] explored motion diffusion generation models conditioned on text. Alexanderson et al. [31] and Zhu et al. [32] were one of the first to synthesize co-speech gestures using the diffusion model. Yang et al. [33] proposed DiffuseStyleGesture, a diffusion-based model using attention mechanisms to generate high-quality, speech-matched, diverse, and stylized gestures. Yuan et al. [34] proposed a physics-based diffusion paradigm to guide motion generation. Ao et al. [35] proposed a neural network for stylized co-speech gesture synthesis using CLIP-guided multimodal prompts and a latent diffusion model, enabling flexible, realistic, and semantically aligned gesture generation. We employ a latent diffusion model to generate co-speech gestures, which alleviates the issue of random jitter in human motion synthesis tasks. We also adopt the concept of latent diffusion for generating co-speech gestures. However, unlike Ao et al. [35], we focus more on the key features within the audio data. In the gesture synthesis stage, the cross-dimension attention mechanism of CDformer is used to minimize the impact of redundant information on the model.

    Our objective is to generate co-speech gestures that are more expressive and demonstrate higher fidelity. Given a N-frame co-speech video, pose sequences x0=[s1,...,sN] are extracted using pose estimators such as Openpose [36] and Expose [37]. To stabilize the training model, we aim at the skeleton deformation problem of different lengths. We follow the baseline method [8,10,32] and define the unit direction vector si=[di,1,...,di,J1]. Here, J is the total number of joints, and di,1 represents the direction vector between two skeletal key points of the J joint in 1th frame. The audio information matching the gesture is represented as a=[a1,...,aN] and is combined with time step t. The initial pose can facilitate a smoother synthesis process, we use the last 4 frames of the previously synthesized gestures as the seed gestures M=[m1,,m4]. We introduce a Motion Auto-Encoder (MAE) [8] that compresses both M and the noisy gesture x0 into a lower-dimensional latent space, where the diffusion process produces latent data xt​. Finally, the reverse denoising is performed within our CDformer network to synthesize the gestures. The combination c of the above context information is input into our audio to a diffusion gesture generation framework G. Figure 2 illustrates the detailed process.

    Figure 2.  The proposed Audio to Diffusion Gesture(A2DG) generation pipeline.

    The end goal p can be described as:

    p=G(c) (1)

    Most research [8,9,10,11] is based on Generative Adversarial Networks [12] and applied to the task of audio-gesture, which is a complex mapping relation. Such training tends to be unstable and causes the mode to collapse. In order to generate high-quality and diverse gestures, we design an audio-driven gesture generation pipeline based on the diffusion model [13,28]. The core idea of the model is to train a probability model to eliminate the normal distribution noise step by step, which can be defined as pθ(x0):=pθ(x0:T)dx1:T, to approximate the real distribution q(x0), and then generate the target gesture, where x1 xT are the latent data.

    The diffusion model is divided into two parts: forward diffusion process and reverse denoising.

    Diffusion process: The forward diffusion process follows the Markov chain [27], and the model will gradually add Gaussian noise to the input data according to the variance schedule βt(0,1), until the input distribution approaches a posteriori distribution N(0,I) :

    q(x1:T|x0)=Tt=1q(xt|xt1), (2)
    q(xt|xt1)=N(xt;1βtxt1,βtI). (3)

    where variance schedule βt(0,1) are hyper-parameters that follows the monotonically decreasing time table.

    Denoising process: The noise gesture xt is obtained by the erosion of the pose sequence X by the input noise in the diffusion process, and the denoising process follows the Markov chain. Reversing the forward process pθ(x0:T) allows sampling ground truth x0 by starting from p(xT)=N(xT;0,I), each step is a learning Gaussian transition (μθ,Σθ) :

    pθ(x0,r)=pθ(xr)rt=1pθ(xt1xt) (4)
    pθ(xt1|xt)=N(xt1;μθ(xt,t),Σθ(xt,t)). (5)

    During the training, we follow Ho et al. [9] to generate samples from more efficiently, which can be formulated as follows:

    q(xt|x0)=¯αtx0+ϵ1¯αt,ϵN(0,I) (6)

    where αt=1βt,ˉαt=ts=0αs. The above unconditional diffusion model can generate a better quality co-speech gesture, but additional conditions need to be injected to control the quality of the generated gesture. Therefore, it is necessary to construct a network to adapt to ϵθ(xt,t,audio). Thus, we can use Eq (6) to generate the noise gesture xt directly. At the time step of uniformly sampling the time point t, the audio feature vectors are extracted from the audio feature constructor, which contains more audio feature information. These conditions constitute context information c, which is input into our proposed CDformer for sequence modeling. We use Mean-Square-Error (MSE) loss to optimize the diffusion model parameters:

    L(θ)=Et[1,T],x0q(x0),εN(0,I)[εεθ(xι,t,audio)2] (7)

    Mel-spectrograms, converted from audio signals, contain rich time-frequency information and align with the human auditory system's perception of audio. Therefore, we propose the AFC (Audio Feature Constructor), which enhances the model's capability to perceive and utilize global audio information by extracting audio features from both 1D audio signals and 2D Mel-spectrograms. Specifically, we employ the audio encoder from Yoon et al. [8], where the raw audio waveform is processed by cascaded 1D-CNNs to generate audio feature vectors. Because using vanilla 1D-CNN to process raw audio waveforms will limit our model to learning the joint distribution between audio and gestures, we concurrently apply a two-stage attention operation [38] to process the Mel-spectrogram. Given MRN×C×F×T as input, which is converted from raw audio, C is the number of channels (set to 1 for mono audio), F is the frequency dimension, and T is the time dimension.

    First stage: We introduce bilinear pooling [39] to capture global audio features in the Mel-spectrogram. Bilinear pooling performs summation pooling on all pairs of audio feature vectors (xi,yi) in the Mel-spectrogram to extract key audio features:

    Gbilinear(X,Y)=XY=ixiyi (8)

    where X and Y are audio feature maps from the same time domain. We use a softmax attention map to collect key audio features from different locations into a set of global descriptors.

    Second stage: As shown in Figure 3, for each time-frequency input position i=1,,FT, an attention vector is generated based on the local audio feature vi. This attention vector supplements the global audio feature information with key audio features from the global descriptors, resulting in the final audio feature vector zi.

    Figure 3.  Two stage audio attention operation for Mel-spectrogram.

    In this section, we combine the initial gestures, time steps, and useful audio features to form the contextual information, which is concatenated with the noise gesture sequence along the feature channels to create condition tokens. These tokens are then fed into our proposed denoising model, CDformer (Cross Dimension Transformer). As illustrated in Figure 2, after linear projection, the input embedding dimension is adjusted to the hidden layer dimension:

    y=Wx+b (9)

    where 𝑊 is the weight matrix, 𝑏 is the bias vector, 𝑥 is the input vector, and 𝑦 is the output vector. The positional embedding parameters provide a unique embedding vector for each gesture, enabling the model to capture the positional information of gestures. We apply the Vision Transformer [40] encoder-decoder network to denoise the noisy gestures. The conditional tokens pass through hierarchical transformer blocks, where the multi-head mechanism splits the input tokens into multiple parts and processes them using the self-attention mechanism:

    Attention(Q,K,V)=σ(QKTd)V (10)

    where σ is the softmax operator, 𝑄 is the query feature vector, 𝐾 is the key feature vector, 𝑉 is the value feature vector, and 𝑑 is the channel dimension of the gesture features.

    When concatenating audio information along the feature channels, redundant information may arise due to the overlap between the two audio modalities. Therefore, we introduce cross-dimension attention [41] to mitigate the impact of this redundancy on the model. By focusing on both spatial and channel information, this mechanism enables the model to autonomously select the most important features for learning. This operation is applied after the layer normalization (layer norm) in the encoder-decoder architecture:

    CD - Attn(token) = 13([VaVmPt]) (11)

    where Va represents features extracted from raw audio, Vm represents features extracted from the Mel-spectrogram, Pt denotes time embedding features, and denotes the concatenation operation.

    We refrained from utilizing co-speech gestures collected in a studio environment, as requiring speakers to produce gestures that perfectly align with their speech often results in exaggerated and insincere expressions [10]. Such an approach contradicts our research objective, which is to acquire naturally fluent and rhythmical co-speech gestures.

    TED Gesture. TED Gesture is a large-scale dataset for co-speech gesture generation, featuring 1776 TED talk videos with various narrators and topics. The dataset includes the 3D poses of speaker's upper bodies and the corresponding audio sequences. Following the data processing approach of previous works [8,10,32], we resampled human poses at 15 FPS (approximately 4 seconds per sample). Each video sequence is 34 frames long with a step length of 10 frames. The upper body posture includes 10 key points, resulting in a total of 252, 109 training samples. These samples are divided into training, validation, and test sets in an 80, 10, and 10% split, respectively.

    TED-Expressive. High-quality finger motion data is essential for generating expressive and meaningful gestures [20], yet it is rare in existing datasets. Building on TED Gesture, TED-Expressive [10] annotates 43 key points on the speaker's upper body, including 13 upper body joints and 30 finger joints, using the 3D pose estimator ExPose [37]. The other settings are consistent with TED Gesture.

    To objectively evaluate the proposed pipeline, we use three common objective evaluation indicators to measure the quality of co-speech gesture generation.

    Fréchet Gesture Distance (FGD). FGD refers to the distribution distance between synthesized gestures and ground truth in the latent feature space. The closer the distribution distance, the more similar the synthetic gesture is to the real one, which is similar to the Fréchet Inception Distance (FID) [42] definition in image generation studies and is the main index to evaluate the rationality of gesture generation. Yoon et al. [8] trained a feature extractor on the Human3.6M [43] dataset for calculating the potential feature X of a real gesture and the potential feature ˆX of a synthetic gesture:

    FGD(X,ˆX)=∥μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2) (12)

    where μr and Σr are the first and second moments of the latent feature distribution Zr of real human gestures X, while μr and Σr come from the generated gestures, and μg and Σg are the first and second moment of the latent feature distribution Zg of generated gestures ˆX. Note that to be fair to the trial on the Ted Gesture and TED-Expressive datasets, we did not train the feature extractor and used the one provided by Yoon et al. [8] and Liu et al. [10].

    Beat Consistency Score (BC). To measure the correlation between synthetic gesture sequence and audio, Li et al. [44] proposed Beat Consistency Score (BC). Because of the differences in the kinematic velocities of human joints, it is necessary to calculate the mean absolute angle change (MAAC) of the angle θj between adjacent frames by:

    MAAC(θj)=Ss=1T1t=1θj,s,t+1θj,s,t1S(T1) (13)

    where S represents the total number of clips in the dataset and T represents the number of frames in each clip.

    We follow Li et al. [44] to calculate the kinematic beat as the local minimum of the kinemat velocity. BC computes the average distance between every audio beat and its nearest kinematic beat with the following equation:

    BC=1nni=1exp(minbxjBxbxibyj22σ2) (14)

    where Bx={bxi} is the kinematic beats, By={byj} is the audio beat, and σ is a parameter to normalize sequences: σ = 0.1 empirically.

    Diversity Score. Diversity is used to assess the degree of variation between the generating motions corresponding to the input [45]. We use the pre-trained autoencoder to capture the potential features of the synthesized gestures and calculate the average distances, randomly select 500 synthesized gestures, and calculate the average absolute error between the feature and the random feature.

    Experimental environment. The software and hardware environment used in this experiment is shown in Table 1.

    Table 1.  Experimental environment configuration.
    Hardware/Software Configuration description
    Operating System Ubuntu 18.04 LTS
    DeepLearning Framework Pytorch 1.13.0
    Programming Language Python 3.7
    CUDA Version 11.7
    Processor Intel Core i5-13600K
    GPU NVIDIA GeForce RTX 4090

     | Show Table
    DownLoad: CSV

    Baselines. We select six state-of-the-art models in recent years to compare with the A2DG proposed in this paper. 1) Seq2Seq [17] follows the Encoder-Decoder structure to generate co-gesture from speech text; 2) speech2Gesture [11] converts the 2D spectrum of an audio signal into a 1D signal as input to generate a co-speech gesture; 3) Joint Embedding [18] maps text and motion into the same Embedding space, which is a representative work of text-generating motion; 4) Trimodal [8] uses audio, text, and speaker identity as context input, and introduces adversarial scheme training method to generate co-speech gesture; 5) HA2G [10] proposes a hierarchical audio-gesture generator across multiple level semantic granularity; and 6) DiffGesture [32] generates co-speech gesture using diffusion gesture stabilizer and annealed noise sampling strategy. These methods were evaluated by training on the TED Gesture dataset and TED-Expressive dataset.

    Experimental details. For a fair comparison, we followed the previous work [8,10,32], setting N = 34 and M = 4, where N indicates the sequence in which the sample split into 34 frames and M indicates the first four frames as seed postures. For subdivision strides S = 10, J = 10 for upper-body joint training on the TED Geture dataset, and J = 43 for upper-body joint (especially for finger joints) training on the TED Expressive dataset, the position of the joint is represented by the normalized unit vector as the direction vector. For the diffusion model, we apply the denoising step T = 500, and the variance schedule is linearly increasing from 0.0001 to 0.02. The Cross Dimensional Transformer consists of a 4-layers transformer encoder with self-attention and feed-forward network and a 4-layer transformer decoder with a similar structure. The hidden dimension of the transformer blocks is set to 256 for TED Gesture and 512 for TED Expressive. We used the Adam Optimizer for model optimization, where β1 = 0.5, β2 = 0.999, and the learning rate is set to 0.0005 and 0.0002 for TED Gesture and TED Expressive, respectively. The model was trained on a single Nvidia GeForce RTX 4090 GPU, batch size = 128, and it takes about 9 hours for TED Gesture and about 14 hours for TED Expressive.

    Objective results and comparison. The quantitative results are shown in Table 2. We compare our approach with prior works on two publicly co-speech gesture datasets. As can be seen, our method achieves state-of-the-art performance on both FGD and Diversity metrics, surpassing the previous best approaches. FGD shows an improvement of 16.8% on TED Gesture, while a significant increase of 56% on TED Expressive, Diversity scores also increased by 1.406 and 0.376, respectively, indicating that our method can generate diverse and high-fidelity gestures. The BC scores we obtained for TED Gesture are lower compared to DiffGesture. It is noteworthy that BC and Diversity are meaningful only when synthesizing smooth and natural motions. However, synthesized gestures may exhibit irregular random jitter, leading to BC and Diversity scores surpassing Ground truth. For instance, in Table 2, the BC score for TED Expressive Ground truth is 0.703, and the Diversity score is 178.827, while DiffGesture scores higher in both metrics with 0.718 and 182.757, respectively, surpassing Ground truth. This could be related to the abundant human joint points in the TED Expressive dataset. Moreover, the Fréchet Gesture Distance exhibits a high degree of statistical correlation with human similarity ratings from large-scale user studies [40]. Hence, apart from Fréchet Gesture Distance, other quantitative metrics should be used as references because they do not always align with the human perception of visual quality [31,46].

    Table 2.  The quantitative results comparison for TED Gesture and TED Expressive. ↓ denotes the lower the better, and ↑ denotes the higher the better. The best results are in bold.
    Methods TED Gesture [8,17] TED Expressive [10]
    FGD ↓ BC ↑ Diversity ↑ FGD ↓ BC ↑ Diversity ↑
    Ground truth 0 0.698 108.525 0 0.703 178.827
    Attention Seq2Seq [17] 18.154 0.196 82.776 54.920 0.152 122.693
    Speech2Gesture [11] 19.254 0.668 93.802 54.650 0.679 142.489
    Joint Embedding [18] 22.083 0.200 90.138 64.555 0.130 120.627
    Trimodal [8] 3.729 0.667 101.247 12.613 0.563 154.088
    HA2G [10] 3.072 0.672 104.322 5.306 0.641 173.899
    DiffGesture [32] 1.506 0.699 106.722 2.600 0.718 182.757
    A2DG (Ours) 1.253 0.678 108.128 1.126 0.718 183.133

     | Show Table
    DownLoad: CSV

    Subjective results and comparison. The gesture visualization result is illustrated in Figure 4. We chose to contrast it with DiffGesture [32], which performed the best among numerous baselines. Ground truth exhibits less variation, lacking diversity or rhythmic gestures. While DiffGesture initially displays good continuity in synthetic gestures, its later transitions into monotonous gestures, with a slight swing of the right arm marked by a red oval and a consistently drooping left arm showing irregular shaking marked by an orange oval. This aligns with our quantitative analysis findings. Our synthesized gestures avoid rigid movement patterns, with relatively smooth transitions between them. Moreover, when descriptive terms like "something negative happens" are present, our gestures demonstrate a level of semantic relevance.

    Figure 4.  The visualization subjective results of synthesized gesture sequence.

    User Study. The performance of the generative model cannot be accurately measured by objective metrics alone [18]. For instance, in the same contextual scenario, as shown in Figure 4, DiffGesture presents a segment of dull gestures in the later stages, whereas A2DG (ours) generates more expressive gestures. Hence, it is imperative to integrate human judgment and objective metrics in assessing gesture generation models. We conducted a user study that compared our proposed pipeline with several baselines [8,10,11,18,32], assessment is conducted based on three aspects of gestures: Naturalness, Smoothness and Synchrony. Specifically, we generated gesture sequences from the TED Gesture and Ted Expressive test dataset and randomly selected 10 slices of approximately 20 seconds. We asked 10 participants to rate the slices after watching them twice. Scores range from 1 to 5, where higher scores indicate greater participant endorsement of gestures synthesized by the model. As shown in Figure 5(a), our approach performs well for all three metrics, for TED Expressive, gesture generation approaches Ground Truth levels in terms of Naturalness, Smoothness, and Synchrony. This may be attributed to the dataset's richness in finger-joint information, resulting in higher-quality synthesized gestures. For TED Gesture, with only 10 upper body joints, gesture synthesis quality is comparatively lower, as depicted in Figure 5(b). Our approach excels other baselines in Naturalness but falls short of L2P in Smoothness and trails behind Trimodal in Synchrony.

    Figure 5.  The statistical results of our user study on TED-Expressive dataset and TED-Gesture dataset. On a scale of 1-5, the higher the better.

    Quantitative ablation study. To validate the effectiveness of the proposed components in our method, we conduct ablation studies on TED Gesture and TED Expressive datasets, and the quantitative ablation results are shown in Table 3. The removal of AFC leads to varying degrees of decline across three evaluation metrics on both datasets, demonstrating the noticeable improvement in our model's ability to learn the joint distribution of audio and gestures after extracting useful audio information through AFC. After removing CDformer, apart from the improvement in the Diversity score on TED Expressive, the remaining two metrics degrade. Mainly benefiting from the robust sequence modeling capability of CDformer, it can impact the quality of generated gestures. Furthermore, given that the FGD metric currently best aligns with human perception among all objective evaluation measures, the significant decrease in FGD after removing our proposed components indicates the effectiveness of our approach in synthesizing high-quality co-speech gestures.

    Table 3.  The results of quantitative ablation study regarding the proposed modules.
    Methods TED Gesture [8,17] TED Expressive [10]
    FGD ↓ BC ↑ Diversity ↑ FGD ↓ BC ↑ Diversity ↑
    Ground truth 0 0.698 108.525 0 0.703 178.827
    w/o AFC 1.803 0.662 105.389 1.339 0.713 177.642
    w/o CDformer 1.325 0.661 107.220 1.453 0.717 180.760
    A2DG (Ours) 1.253 0.678 108.128 1.126 0.718 183.133

     | Show Table
    DownLoad: CSV

    Qualitative ablation study. We conduct a qualitative ablation study on the proposed modules, the results are shown in Figure 6. Without our Audio Feature Constructor, simply injecting raw audio information into the network degrades the quality of the synthesized gestures. We highlight the unnatural gestures generated by our network within the red box. Our complete pipeline synthesizes diverse and meaningful gestures. For instance, when saying "now depend", our synthesized gesture extends the arm to emphasize the stressed word.

    Figure 6.  The visualization qualitative ablation results of synthesized gesture sequence.

    In this paper, we propose a diffusion model-based audio-driven co-speech gesture generation framework comprising two modules: AFC and CDformer. The AFC module extracts useful audio feature information from raw audio waveforms and Mel-spectrograms, enhancing the model's ability to learn the joint distribution between audio and gestures. The cross-dimension attention in the CDformer module focuses on spatial and channel information, thereby reducing the impact of redundant information on the model. Leveraging the powerful sequence modeling capabilities of the Transformer, our method can generate diverse and realistic gestures.

    Our research is limited to upper body movements, and during the synthesis phase, speaker identity was not incorporated to generate gestures with personal style. Therefore, in future work, we plan to explore additional audio cues, such as analyzing prosodic features and extracting speaker specific characteristics from Mel-spectrograms, to generate stylized full-body co-speech gestures.

    The authors declare they have not used artificial intelligence (AI) tools in the creation of this article.

    The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is supported by the Fujian Province Industrial Guidance (Key) Project (Grant No. 2022H0053), the Sanming Major Science and Technology Project of Industry-University-Research Collaborative Innovation (Grant No. 2022-G-4) and the Start-up Research Project of Fujian University of Technology (Grant Nos. GY-Z21064, GY-Z21065).

    The authors declare there are no conflicts of interest.



    [1] M. Borg, D. Lockerby, J. Reese, Fluid simulations with atomistic resolution: a hybrid multiscale method with field-wise coupling, J. Comput. Phys., 255 (2013), 149–165. doi: 10.1016/j.jcp.2013.08.022
    [2] M. Borg, D. Lockerby, J. Reese, A hybrid molecular–continuum method for unsteady compressible multiscale flows, J. Fluid Mech., 768 (2015), 388–414. doi: 10.1017/jfm.2015.83
    [3] F. Brezzi, M. Fortin, Mixed and hybrid finite element methods, Springer Science & Business Media, 2012.
    [4] M. J. Crochet, A. R. Davies, K. Walters, Numerical simulation of non-Newtonian flow, Elsevier, 2012.
    [5] P. J. Daivis, B. Todd, A simple, direct derivation and proof of the validity of the sllod equations of motion for generalized homogeneous flows, J. Chem. Phys., 124 (2006), 194103. doi: 10.1063/1.2192775
    [6] M. Dobson, Periodic boundary conditions for long-time nonequilibrium molecular dynamics simulations of incompressible flows, J. Chem. Phys., 141 (2014), 184103. doi: 10.1063/1.4901276
    [7] A. Donev, J. B. Bell, A. L. Garcia, B. J. Alder, A hybrid particle-continuum method for hydrodynamics of complex fluids, SIAM J. Multiscale Model. Simul., 8 (2010), 871–911. doi: 10.1137/090774501
    [8] W. E, B. Enquist, The heterogeneous multiscale method., Commun. Math. Sci., 1 (2003), 87–133. doi: 10.4310/CMS.2003.v1.n1.a8
    [9] W. E, X. Li, Analysis of the heterogeneous multiscale method for gas dynamics, Methods Appl. Anal., 11 (2004), 557–572. doi: 10.4310/MAA.2004.v11.n4.a7
    [10] W. E, J. Lu, Seamless multiscale modeling via dynamics on fiber bundles, Commun. Math. Sci., 5 (2007), 649–663. doi: 10.4310/CMS.2007.v5.n3.a7
    [11] W. E, P. Ming, P. Zhang, Analysis of the heterogeneous multiscale method for elliptic homogenization problems, J. Amer. Math. Soc., 18 (2005), 121–157.
    [12] W. E, W. Ren, E. Vanden-Eijnden, A general strategy for designing seamless multiscale methods, J. Comput. Phys., 228 (2009), 5437–5453. doi: 10.1016/j.jcp.2009.04.030
    [13] D. A. Fedosov, G. E. Karniadakis, Triple-decker: Interfacing atomistic–mesoscopic–continuum flow regimes, J. Comput. Phys., 228 (2009), 1157–1171. doi: 10.1016/j.jcp.2008.10.024
    [14] D. A. Fedosov, G. E. Karniadakis, B. Caswell, Dissipative particle dynamics simulation of depletion layer and polymer migration in micro- and nanochannels for dilute polymer solutions, J. Chem. Phys., 128 (2008), 144903. doi: 10.1063/1.2897761
    [15] E. G. Flekkoy, G. Wagner, J. Feder, Hybrid model for combined particle and continuum dynamics, Europhys. Lett., 271 (2000), 271–276.
    [16] D. Frenkel, B. Smit, Understanding molecular simulation: from algorithms to applications, Elsevier, 2001.
    [17] G. G. Giusteri, R. Seto, A theoretical framework for steady-state rheometry in generic flow conditions, J. Rheol., 62 (2018), 713–723. doi: 10.1122/1.4986840
    [18] T. A. Hunt, Periodic boundary conditions for the simulation of uniaxial extensional flow of arbitrary duration, Mol. Simulat., 42 (2016), 347–352. doi: 10.1080/08927022.2015.1051043
    [19] I. G. Kevrekidis, G. Samaey, Equation-free multiscale computation: Algorithms and applications, Ann. Rev. Phys. Chem., 60 (2009), 321–344. doi: 10.1146/annurev.physchem.59.032607.093610
    [20] I. G. Kevrekidis, C. W. Gear, J. M. Hyman, P. G. Kevrekidis, O. Runborg, C. Theodoropoulos, Equation-free coarse-grained, multiscale computation: enabling microscopic simulators to perform system-level analysis, Commun. Math. Sci., 1 (2004), 715–762.
    [21] I. G. Kevrekidis, C. W. Gear, G. Hummer, Equation-free: The computer-aided analysis of complex multiscale systems, AIChE J., 50 (2004), 1346–1355. doi: 10.1002/aic.10106
    [22] A. Kraynik, D. Reinelt, Extensional motions of spatially periodic lattices, Int. J. Multiph. Flow, 18 (1992), 1045–1059. doi: 10.1016/0301-9322(92)90074-Q
    [23] H. P. Langtangen, A. Logg, Solving PDEs in python: the FEniCS tutorial I, Springer Nature, 2016.
    [24] R. G. Larson, The structure and rheology of complex fluids, New York: Oxford university press, 1999.
    [25] A. Logg, G. N. Wells, Dolfin: Automated finite element computing, ACM Trans. Math. Software, 37 (2010), 1–28.
    [26] M. Minale, P. Moldenaers, J. Mewis, Effect of shear history on the morphology of immiscible polymer blends, Macromolecules, 30 (1997), 5470–5475. doi: 10.1021/ma9617330
    [27] P. Neumann, W. Eckhardt, H.-J. Bungartz, Hybrid molecular-continuum methods: from prototypes to coupling software, Comput. Math. Appl., 67 (2014), 272–281. doi: 10.1016/j.camwa.2013.07.006
    [28] D. A. Nicholson, G. C. Rutledge, Molecular simulation of flow-enhanced nucleation in n-eicosane melts under steady shear and uniaxial extension, J. Chem. Phys., 145 (2016), 244903. doi: 10.1063/1.4972894
    [29] S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys., 117 (1995), 1–19. doi: 10.1006/jcph.1995.1039
    [30] W. Ren, Seamless multiscale modeling of complex fluids using fiber bundle dynamics, Commun. Math. Sci., 5 (2007), 1027–1037. doi: 10.4310/CMS.2007.v5.n4.a15
    [31] P. Schunk, L. Scriven, Constitutive equation for modeling mixed extension and shear in polymer solution processing, J. Rheol., 34 (1990), 1085–1119. doi: 10.1122/1.550075
    [32] S. Stalter, L. Yelash, N. Emamy, A. Statt, M. Hanke, M. Lukáčová-Medvid'ová, et al., Molecular dynamics simulations in hybrid particle-continuum schemes: Pitfalls and caveats, Comput. Phys. Commun., 224 (2018), 198–208. doi: 10.1016/j.cpc.2017.10.016
    [33] C. K. Williams, C. E. Rasmussen, Gaussian processes for regression, In: Advances in neural information processing systems 8, 1995,514–520.
    [34] S. Yasuda, R. Yamamoto, Multiscale modeling and simulation for polymer melt flows between parallel plates, Phys. Rev. E, 81 (2010), 036308. doi: 10.1103/PhysRevE.81.036308
  • This article has been cited by:

    1. Antonio DeSimone, 2020, Chapter 1, 978-3-030-45196-7, 1, 10.1007/978-3-030-45197-4_1
    2. Ahmed Mourran, Oliver Jung, Rostislav Vinokur, Martin Möller, Microgel that swims to the beat of light, 2021, 44, 1292-8941, 10.1140/epje/s10189-021-00084-z
    3. Mahdi Nasiri, Hartmut Löwen, Benno Liebchen, Optimal active particle navigation meets machine learning (a) , 2023, 142, 0295-5075, 17001, 10.1209/0295-5075/acc270
    4. Michael te Vrugt, Raphael Wittkowski, Metareview: a survey of active matter reviews, 2025, 48, 1292-8941, 10.1140/epje/s10189-024-00466-z
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4062) PDF downloads(199) Cited by(4)

Figures and Tables

Figures(13)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog