
As a key input factor in industrial production, the price volatility of crude oil often brings about economic volatility, so forecasting crude oil price has always been a pivotal issue in economics. In our study, we constructed an LSTM (short for Long Short-Term Memory neural network) model to conduct this forecasting based on data from February 1986 to May 2021. An ANN (short for Artificial Neural Network) model and a typical ARIMA (short for Autoregressive Integrated Moving Average) model are taken as the comparable models. The results show that, first, the LSTM model has strong generalization ability, with stable applicability in forecasting crude oil prices with different timescales. Second, as compared to other models, the LSTM model generally has higher forecasting accuracy for crude oil prices with different timescales. Third, an LSTM model-derived shorter forecast price timescale corresponds to a lower forecasting accuracy. Therefore, given a longer forecast crude oil price timescale, other factors may need to be included in the model.
Citation: Kexian Zhang, Min Hong. Forecasting crude oil price using LSTM neural networks[J]. Data Science in Finance and Economics, 2022, 2(3): 163-180. doi: 10.3934/DSFE.2022008
[1] | Takao Komatsu, Ram Krishna Pandey . On hypergeometric Cauchy numbers of higher grade. AIMS Mathematics, 2021, 6(7): 6630-6646. doi: 10.3934/math.2021390 |
[2] | Takao Komatsu, Wenpeng Zhang . Several expressions of truncated Bernoulli-Carlitz and truncated Cauchy-Carlitz numbers. AIMS Mathematics, 2020, 5(6): 5939-5954. doi: 10.3934/math.2020380 |
[3] | Taekyun Kim, Hye Kyung Kim, Dae San Kim . Some identities on degenerate hyperbolic functions arising from p-adic integrals on Zp. AIMS Mathematics, 2023, 8(11): 25443-25453. doi: 10.3934/math.20231298 |
[4] | Dojin Kim, Patcharee Wongsason, Jongkyum Kwon . Type 2 degenerate modified poly-Bernoulli polynomials arising from the degenerate poly-exponential functions. AIMS Mathematics, 2022, 7(6): 9716-9730. doi: 10.3934/math.2022541 |
[5] | Tabinda Nahid, Mohd Saif, Serkan Araci . A new class of Appell-type Changhee-Euler polynomials and related properties. AIMS Mathematics, 2021, 6(12): 13566-13579. doi: 10.3934/math.2021788 |
[6] | Jizhen Yang, Yunpeng Wang . Congruences involving generalized Catalan numbers and Bernoulli numbers. AIMS Mathematics, 2023, 8(10): 24331-24344. doi: 10.3934/math.20231240 |
[7] | Ling Zhu . Asymptotic expansion of a finite sum involving harmonic numbers. AIMS Mathematics, 2021, 6(3): 2756-2763. doi: 10.3934/math.2021168 |
[8] | Waseem Ahmad Khan, Kottakkaran Sooppy Nisar, Dumitru Baleanu . A note on (p, q)-analogue type of Fubini numbers and polynomials. AIMS Mathematics, 2020, 5(3): 2743-2757. doi: 10.3934/math.2020177 |
[9] | Letelier Castilla, William Ramírez, Clemente Cesarano, Shahid Ahmad Wani, Maria-Fernanda Heredia-Moyano . A new class of generalized Apostol–type Frobenius–Euler polynomials. AIMS Mathematics, 2025, 10(2): 3623-3641. doi: 10.3934/math.2025167 |
[10] | Nadia N. Li, Wenchang Chu . Explicit formulae for Bernoulli numbers. AIMS Mathematics, 2024, 9(10): 28170-28194. doi: 10.3934/math.20241366 |
As a key input factor in industrial production, the price volatility of crude oil often brings about economic volatility, so forecasting crude oil price has always been a pivotal issue in economics. In our study, we constructed an LSTM (short for Long Short-Term Memory neural network) model to conduct this forecasting based on data from February 1986 to May 2021. An ANN (short for Artificial Neural Network) model and a typical ARIMA (short for Autoregressive Integrated Moving Average) model are taken as the comparable models. The results show that, first, the LSTM model has strong generalization ability, with stable applicability in forecasting crude oil prices with different timescales. Second, as compared to other models, the LSTM model generally has higher forecasting accuracy for crude oil prices with different timescales. Third, an LSTM model-derived shorter forecast price timescale corresponds to a lower forecasting accuracy. Therefore, given a longer forecast crude oil price timescale, other factors may need to be included in the model.
Deep learning techniques have proven successful in various computer vision fields, such as image classification [1], object detection [2], and semantic segmentation [3]. However, the effectiveness of deep learning relies heavily on large, labeled training datasets, which could be labor-intensive to annotate. When dealing with large unlabeled datasets, it is often impractical to label enough data for training a deep learning model. An alternative approach is transfer learning, where labeled data from related domains (source domain) are utilized to enhance the model's performance in the domain of interest (target domain). Transfer learning is the process of applying knowledge learned from a labeled source domain to a target domain, where labeled data may be limited or unavailable.
Pan [4] classified transfer learning into three categories according to labeled data in the two domains used during training. They are (1) inductive transfer learning: When the target domain data is labeled, irrespective of whether the source domain data is labeled or not; (2) transductive transfer learning: When only the source domain data is labeled, while the target domain data remains unlabeled; and (3) unsupervised transfer learning: When both domains lack labels. Transductive transfer learning can be further divided into two types: (1) domain adaptation: When both domains use the same attributes but have different marginal probability distributions; and (2) sample selection bias: When the sample spaces or data types of the two domains are different, such as images in the source domain and text in the target domain. This paper focuses on unsupervised domain adaptation (UDA), which aims to minimize the distribution discrepancy between data from two domains, enabling successful knowledge transfer from the source domain to the target domain.
Currently, numerous domain adaptation methods have been researched and developed [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. These methods fall into three main categories [4]: Instance reweighting methods [5,6,7], feature extraction methods [8,9,10,11], and classifier adaptive approaches [12,13]. Feature extraction methods aim to learn domain-invariant feature representations and are broadly categorized into two types [14]: Adversarial learning-based approaches [15,16,17] and statistics-based approaches [18,19,20]. Adversarial learning-based methods seek to achieve domain-invariant features by generating images or feature representations from different domains. For instance, the deep reconstruction classification network (DRCN) [21] establishes a classifier for labeled source domain data and constructs a domain-invariant feature representation shared with unlabeled target domain data. Statistical methods involve defining a suitable measure of difference or distance between two distinct distributions [18,24,25,26,27,28,29]. Various distance metrics, such as quadratic [30], Kullback-Leibler [31] and Mahalanobis [32], have been proposed over the years. However, these methods are not easily adaptable to different domain adaptation (DA) models and may not effectively describe complex distributions like conditional and joint distributions due to theoretical limitations. In recent years, the MMD [28], initially used for two-sample testing, has been found to be effective in calculating the distance between sample distributions from two domains in feature space. It facilitates alignment between the distributions by minimizing the MMD between them. The method presented in this paper falls under this category. Long et al. [9] introduced the regularization of MMD, utilizing it to reduce the distribution difference between the feature distributions of two domains in hidden layers of deep adaptation networks.
The use of MMD focuses mainly on aligning the overall distribution of two domains, but often falls short in ensuring precise alignment of data within the same category across domains. In response, Long et al. [19] proposed the class-wise maximum mean discrepancy (CWMMD) to enhance robust domain adaptation. The two-domain samples are linearly mapped into a common feature space and the MMD for each category is calculated, then summed to obtain a CWMMD. Wang et al. [33] highlighted that minimizing the MMD is equivalent to minimizing the overall data variance while simultaneously maximizing the intra-class distances of the source and target domains, leading to a decrease in feature discriminativeness. They adjusted balance parameters to mitigate this issue but were limited to linear transformations in the feature space. However, they used the L2 norm as the MMD estimator in a linearly transformed feature space. It is worth noting that the L2 norm is not well suited for general estimation [34,35], and that linear transformations may not adequately capture complex data relationships, especially if nonlinear mappings are required.
In contrast, deep neural networks, particularly convolutional neural networks (CNNs), learn powerful and expressive nonlinear transformations. This paper proposes a method to improve upon this, which involves training a CNN architecture, so that the model automatically learns feature representations that are well-suited for the task at hand. Furthermore, the loss function used in the domain adaptation process can be efficiently evaluated in a reproduced kernel hilbert space (RKHS). This facilitates effective alignment of data belonging to the same class from both the source and target domains in the shared feature space.
This section presents the related research, including pseudo labels and different variants of MMD.
Computing a class-level MMD during training requires the use of pseudo-labels for unlabeled target domain data. A simple way to generate pseudo labels is directly applying formula (1) to the source domain model [14]; that is, input the target sample xt into the source domain model f = C ∙ F, which comprises a feature extractor F and a classifier C, to obtain the softmax result of classification δ=(δ1,δ2,…,δC), and then set the index of the maximum of components in the output vector δ as the pseudo label for the target sample xt. However, due to domain shift, the pseudo-label generated by this method may have large bias. Instead, this study adopts another method, the self-supervised pseudo-label strategy proposed by Liang et al. [36]. The strategy first uses the current target domain model to calculate the centroid of each category for the target domain data, which is similar to weighted K-means clustering, as shown in formula (2). These centroids robustly represent the distribution of different classes in the target domain data. Next, the category of the nearest centroid for each target domain data is obtained as its pseudo label, as shown in formula (3), where Dcos(a, b) means the cosine distance between a and b. The new pseudo-label is then utilized to recalculate the centroid, as shown in formula (4), and update the pseudo-label again, as shown in formula (5). Finally, with the new pseudo labels, the target model is self-supervised using the cross-entropy loss function, as shown in (6).
ˆyt=argmax1≤k≤Cδk(f(xt)), | (1) |
C(0)k=∑xt∈Xtδk(f(xt))F(xt)∑xt∈Xtδk(f(xt)),k=1,2,…,C | (2) |
ˆyt=argmin1≤k≤CDcos(boldsymbolF(xt), C(0)k), | (3) |
C(1)k=∑xt∈Xt1(ˆyt=k)boldsymbolF(xt)∑xt∈Xt1(ˆyt=k), | (4) |
ˆyt=argmin1≤k≤CDcos(F(xt), C(1)k), | (5) |
LsslT(ft;Xt,ˆYt)=−E(x,ˆyt)∈Xs×^Yt∑Kk=11[k=ˆyt]logδk(f(x)). | (6) |
The MMD is a distance measure between feature means. Gretton et al. [28] introduced an MMD measure, which involves embedding distribution metrics in the RKHS and using it to conduct a two-sample test for detecting differences between two unknown distributions, p and q. The purpose of this test is to draw two sets of samples X and Y from these distributions and to determine whether p and q are different distributions. They applied a kernel-based MMD to two-sample tests on various problems and achieved excellent performance. Furthermore, these kernel-based MMDs have been shown to be consistent, asymptotically normal, robust to model misspecification, and have been successfully applied to various problems, including transfer learning [29], kernel Bayesian inference [37], approximate Bayesian computation [38], two-sample testing [28], optimal degree-of-fit testing [39], generating moment matching networks (GMMN) [40], and autoencoders [41].
Gaussian kernel-based MMDs are commonly used estimators, which have the key property of universality, allowing estimators to converge to the best approximation for generating distribution of the (unknown) data in the model, without making any assumptions about this distribution. In contrast, the L2 norm lacks the above properties, suffers from the curse of dimensionality, and is not suitable for universal estimation [34,35]. Furthermore, Gaussian kernel-based MMDs also serve as an effective measure for domain differences in UDA scenarios, and their computation is streamlined by applying a kernel function directly to the samples.
The squared MMD in an RKHS, denoted as ‖μp−μq‖2H, can be straightforwardly expressed using kernel functions. Additionally, it is possible to easily derive an unbiased estimate for finite samples. Considering independent random variables x and x' from distribution p, as well as independent random variables y and y' from distribution q, let H be a universal RKHS with unit ball denoted F, and one can give the squared MMD, as shown in (7) [32], where ϕ(⋅) is a function mapping the samples to H, μp=Ex∼p[ϕ(x)] and μq=Ey∼q[ϕ(y)] representing the kernel mean embeddings, and k is set to the commonly used Gaussian kernel, as shown in (8). An unbiased empirical estimate is given in (9), where X={x1,…,xm} and Y={y1,…,yn} are two sets randomly sampled from two probability distributions p and q, respectively. However, there is no definitive method for selecting the bandwidth σ of the kernel k in (9). Gretton et al. [28] suggested using the median distance between samples as the bandwidth, but did not verify that this choice is optimal.
(MMD(F,p,q))2=‖μp−μq‖2H=⟨μp−μq,μp−μq⟩H=⟨μp,μp⟩H+⟨μq,μq⟩H−2⟨μp,μq⟩H |
=Ex,x'∼p[⟨ϕ(x),ϕ(x')⟩H]+Ey,y'∼q[⟨ϕ(y),ϕ(y')⟩H]−2Ex∼p,y∼q[⟨ϕ(x),ϕ(y)⟩H] |
=Ex,x'∼p[k(x,x')]+Ey,y'∼q[k(y,y')]−2Ex∼p,y∼q[k(x,y)], | (7) |
k(a,b)=exp(−‖a−b‖222σ2ϕ), | (8) |
(MMDu(F,X,Y))2=1m(m−1)∑mi≠jk(xi,xj)+1n(n−1)∑ni≠jk(yi,yj)−2mn∑m,ni,jk(xi,yj). | (9) |
Although MMD has been commonly used in cross-domain problems, minimizing the MMD between the samples from two domains only narrows their marginal distributions. Long et al. [8] proposed joint distribution adaptation (JDA), which jointly adapts marginal and conditional distributions in a reduced-dimensional principal component space and constructs new feature representations. They adopted a principle component analysis (PCA) transformation and minimized the Euclidean distance between the sample means of the two domains in a reduced-dimensional principal component space. They referred to this method as MMD for marginal distribution, as shown in formula (10), where Xs ∈Rd×ns and Xt∈Rd×nt are the samples from the source and target domains, respectively. Here, ns and nt are the numbers of samples from the source and target domains, d is the sample dimension, and ϕ represents a linear transformation function. Let A be the d×K standard matrix of the linear transformation function ϕ. Formula (10) can then be rewritten as formula (11), where Xst=[Xs|Xt] and M0∈Rnst×nst are calculated as shown in formula (12).
MMD2=||1ns∑xi∈Xsϕ(xi)−1nt∑xj∈Xtϕ(xj)||22, | (10) |
MMD2=||1ns∑xi∈XsATxi−1nt∑xj∈XtATxj||22=tr(ATXstM0XTstA), | (11) |
(M0)ij={1nsns,xi,xj∈Xs1ntnt,xi,xj∈Xt−1nsnt,otherwise. | (12) |
In addition to the MMD for marginal distribution, they proposed also the MMD for conditional distribution. However, during empirical estimation, to obtain samples for each category, labels for target domain samples that do not exist need to be provided. To address this, they suggested using pseudo labels for the target samples, which can be obtained either from the current classifier trained on the samples from the source domain or through other methods. They named the MMD for conditional distribution as class-wise MMD (CWMMD), as shown in Eq (13). Here, Xcs and Xct represent the data samples of the cth category from the source and target domains, respectively, while ncs and ncs are the respective sample sizes and the calculation of Mc∈Rncst×ncst is detailed in (14).
CWMMD2=∑Cc=1||1ncs∑xi∈XcsATxi−1nct∑xj∈XctATxj||22=∑Cc=1tr(ATXstMcXTstA), | (13) |
(Mc)ij={1ncsncs,xi,xj∈Xcs1nctnct,xi,xj∈Xct−1ncsnct,xi∈Xcs,xj∈Xct or xj∈Xcs,xi∈Xct0,otherwise. | (14) |
In JDA, both marginal distribution discrepancy and conditional distribution discrepancy across domains are simultaneously minimized. Consequently, the optimization problem for JDA is resolved by combining Eqs (11) and (13), as shown in Eq (15). In (15), the constraint condition ATXstHstXTstA=IK×K limits the overall data variation to a fixed value, ensuring that data information on the subspace is statistically retained to some extent. ‖A‖2F controls the size of the matrix A, and α is a regularization parameter ensuring a well-defined optimization problem. It is important to note that Hst=Inst×nst−1nst1nst×nst is a centering matrix, where ncst=ncs+ncs and 1nst×nst is a matrix of size nst×nst with all elements being one.
minA∑Cc=0tr(ATXstMcXTstA)+α‖A‖2Fs.t.ATXstHstXTstA=IK×K. | (15) |
Wang et al. [33] proposed an insight into the working principle of MMD and theoretically revealed its high degree of agreement with human transferable behavior. In Figure 1 [33], when considering a pair of classes labeled "desktop computers", respectively, from the source and target domains, the process of minimizing the MMD between these two distributions involves two key transformations. They are (1) two relatively small red circles (hollow and mesh circles) transformed into larger red ones, which are magnified; i.e., maximizing their specific intra-class distance and (2) two tiny red circles gradually moving closer along their specific arrows; i.e., minimizing their joint variance. This process is analogous to how humans abstract common features to encompass all possible appearances, but the detailed information is heavily decayed. Wang et al. also theoretically demonstrated this insight.
Let (S(A,X))cinter = tr(ATXstMcXTstA) denote the inter-class distance (i.e., square of MMD) between the cth class data in the source domain and the target domain in the transformation space according to the transformation matrix A, and let Sinter=∑Cc=1(S(A,X))cinter, then (15) is written as (16). Wang et al. derived Sinter=Svar−Sintra, so (16) is written as (17), where Sintra represents the intra-class distance, and Svar is the variance of the entire data. Therefore, minimizing the inter-class distance Sinter is equivalent to maximizing their variation Svar, and maximizing the intra-class distance Sintra at the same time, which will reduce feature discriminativeness. To address this, a trade-off parameter is introduced to adjust the hidden intra-class distance in Sinter, as shown in Eq (18). They obtained an optimal linear transformation matrix A, thus minimizing the loss evaluated in this transformation space.
minA[Sinter+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k, | (16) |
minA[Svar−Sintra+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k, | (17) |
minA[Svar+β∙Sintra+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k. | (18) |
The unsupervised domain adaptation training proposed in this paper focuses on using discriminative CWMMD (DCWMMD) to align data of the same class between the source and target domains. By alleviating the problem of MMD through reducing feature discriminativeness while minimizing the mean difference between the two domains, the proposed method effectively achieves the goal of unsupervised domain adaptation.
Unlike Wang et al. [33], who used the L2-norm as an MMD estimator in the linearly transformed feature space, this study employed a network to train a feature space. Samples in this space are then projected into an RKHS to efficiently evaluate and minimize the loss function. The Gaussian kernel is commonly used because the RKHS with the Gaussian kernel is guaranteed to be universal [30]. Wang et al. proposed that the inter-class distance is equal to the variation minus the intraclass distance under the MMD they defined. This section reformulates the interclass distance, intra-class distance, and variation as defined by Wang et al. and adopts the MMD with the Gaussian kernel. Moreover, it provides a proof that when using this MMD to measure the distance of distribution of samples from two domains, the interclass distance is indeed equal to the variation minus the intra-class distance.
This study considers only two domains, one source domain and one target domain. Xcs and Xct represent the sample sets of class c from the source domain and the target domain, respectively, and Xc (or Xcst) represents the union of all sample sets of class c in both domains, i.e., Xc=Xcst=Xcs∪Xct. More symbols and notations are presented in a nomenclature table provided in Table 1.
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
This subsection uses the RKHS-based MMD and leverages the kernel trick to efficiently compute the interclass distance, intra-class distance, and variation between samples from the two domains. Moreover, it demonstrates that when using the Gaussian kernel-based MMD, the inter-class distance can be decomposed into their respective intraclass distances and variations.
Definition 3.1. Interclass distance.
The square of the interclass distance between the samples from the source domain and the target domain is defined as Sinter=∑Cc=1(S)cinter, where (S)cinter=(Sst)cinter is the square of the interclass distance (or MMD) between the samples of class c from the source domain and the target domain, as shown in (19), which is derived as the forms in (20) and (21).
(S)cinter=k(mcs−mct,mcs−mct), | (19) |
(S)cinter=[ncs+nctnctk(mcs−mcst,mcs−mcst)+ncs+nctncsk(mct−mcst,mct−mcst], | (20) |
(S)cinter=ncs+nctncsnct(ncsk(mcs−mcst,mcs−mcst)+nctk(mct−mcst,mct−mcst). | (21) |
Definition 3.2. Intraclass distance.
The square of the intra-class distance between the samples from the source domain and the target domain is defined as Sintra=∑Cc=1(S)cintra, where (S)cintra=(Sst)cintra is the square of the intraclass distance of the samples of class c from the two domains, as defined in (22).
(S)cintra=ncs+nctncsnct(∑xi∈Xcsk(xi−mcs,xi−mcs)+∑xj∈Xctk(xj−mct,xj−mct). | (22) |
Definition 3.3. Variance.
The joint variance of the samples from the source domain and the target domain is defined as Svar=∑Cc=1(S)cvar, where (S)cvar=(Sst)cvar is the joint variance of the samples of class c from the source domain and the target domain, as shown in (23).
(S)cvar=(Sst)cvar=ncs+nctncsnct∑xi∈Xcstk(xi−mcst,xi−mcst). | (23) |
The square of the MMD between samples Xs and Xt from the source domain and the target domain, respectively is defined using (24) or derived as (25). Conceptually, this is equivalent to treating the samples from both domains as belonging to the same class and computing the interclass distance. This can be expressed as (S)ointer=(Sst)ointer=(MMD(Xs,Xt))2 or (MMDu(Xs,Xt))2 when unbiased estimation is employed, as demonstrated in formulas (26) and (27), respectively.
(MMD(Xs,Xt))2=k(ms−mt,ms−mt) |
=ns+ntntk(ms−mst,ms−mst)+ns+ntnsk(mt−mst,mt−mst, | (24) |
(MMD(Xs,Xt))2=ns+ntnsnt(nsk(ms−mst,ms−mst)+ntk(mt−mst,mt−mst), | (25) |
(MMD(Xs,Xt))2=1(ns)2∑xi∈Xs∑xj∈Xsk(xi,xj) |
+1(nt)2k(xi,xj)−2nsnt∑xi∈Xs∑xj∈Xtk(xi,xj) | (26) |
(MMDu(Xs,Xt))2=1ns(ns−1)∑xi,xj∈Xsxi≠xjk(xi,xj) |
+1nt(nt−1)∑xi,xj∈Xtxi≠xjk(xi,xj)−2nsnt∑xi∈Xs∑xj∈Xtk(xi,xj). | (27) |
Theorem 3.1. The square of the interclass distance equals the data variance minus the square of the intra-class distance; that is, Sinter=Svar−Sintra.
Proof. For (S)cinter=ncs+nctncsnct(ncsk(mcs−mcst,mcs−mcst)+nctk(mct−mcst,mct−mcst)), (S)cvar=ncs+nctncsnct∑xi∈Xcstk(xi−mcst,xi−mcst), and (S)cintra=ncs+nctncsnct(∑xi∈Xcsk(xi−mcs,xi−mcs)+∑xj∈Xctk(xj−mct,xj−mct)), it is sufficient to prove that (S)cinter+(S)cintra=(S)cvar or that ncsdk(mcsd−mc,mcsd−mc)+∑xi∈Xcsdk(xi−mcsd,xi−mcsd)=∑xi∈Xcsdk(xi−mc,xi−mc) for 1≤c≤C and sd∈{s,t}. Since ncsdk(mcsd−mc,mcsd−mc)=∑xi∈Xcsdk(mcsd−mc,mcsd−mc)=∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(mcsd,mc))=∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(xi,mc)) and ∑xi∈Xcsdk(xi−mcsd,xi−mcsd)=∑xi∈Xcsd(k(xi,xi)+k(mcsd,mcsd)−2k(xi,mcsd))=∑xi∈Xcsd(k(xi,xi)+k(mcsd,mcsd)−2k(mcsd,mcsd))=∑xi∈Xcsd(k(xi,xi)−k(mcsd,mcsd)), we have ncsdk(mcsd−mc,mcsd−mc)+∑xi∈Xcsdk(xi−mcsd,xi−mcsd)= ∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(xi,mc)+k(xi,xi)−k(mcsd,mcsd))=∑xi∈Xcsdk(xi−mc,xi−mc). This completes the proof.
Theorem 3.1 indicates that minimizing the interclass distance is equivalent to minimizing their variation, while simultaneously maximizing the intra-class distance, thus reducing feature discriminativeness. To address this, the strategy proposed by Wang et al. [33] was adopted with a trade-off parameter β (−1≤β≤1) introduced to adjust the hidden intra-class distance within Sinter, resulting in the formulation of the discriminative class-level loss function, denoted as Ldcwmmd in formula (28), and its expansion is given in formula (29).
Ldcwmmd=Svar+β∙Sintra+MMD2(Xs,Xt) |
=∑Cc=1(Sst)cvar+β∙∑Cc=1(Sst)cintra+(Sst)0inter, | (28) |
Ldcwmmd= |
∑Cc=1ncs+nctncsnct(∑xj∈Xcst⟨xj−mcst,xj−mcst⟩H+β∑xi∈Xcs⟨xi−mcs,xi−mcs⟩H+β∑xj∈Xct⟨xj−mct,xj−mct⟩H)+ns+ntnt<ms−mst,ms−mst>+ns+ntns<mt−mst,mt−mst>. | (29) |
The terms (Sst)cinter, (Sst)cintra, and (Sst)cvar, defined in Definitions 3.1 to 3.3, can be expressed in terms of individual sample representations xj's using formulas (30) to (32). To ensure unbiased estimation and calculate deviations in the feature space, this study uses the loss function Ludcwmmd, represented by the feature representations zj's of the samples xj's, as shown in (28). By setting α1=(β+1)(ncs+nct)ncsnct, α2=−(ncs+(ncs+nct)β(ncs)2nct), α3=−(nct+(ncs+nct)βncs(nct)2), α4=−(2ncsnct), γ1=1ns(ns−1), γ2=1nt(nt−1), and γ3=−2nsnt, the simplified form of Ludcwmmd is given in formula (34). During the training process, these scalar values can be precomputed and stored, eliminating the need for subsequent recalculation.
(Sst)cinter=1(ncs)2∑xi,xj∈Xcs⟨xi,xj⟩H+1(nct)2∑xi,xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs,xj∈Xct⟨xi,xj⟩H, | (30) |
(Sst)cintra=ncs+nctncsnct[∑xi∈Xcst⟨xi,xi⟩H−1ncs∑xi,xj∈Xcs⟨xi,xj⟩H−1nct∑xi,xj∈Xct⟨xi,xj⟩H], | (31) |
(Sst)cvar=ncs+nctncsnct∑xj∈Xcst⟨xj,xj⟩H−1ncsnct∑xi,xj∈Xcs⟨xi,xj⟩H |
−1ncsnct∑xi,xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs,xj∈Xct⟨xi,xj⟩H, | (32) |
Ludcwmmd(Xs,Xt)=∑Cc=1[(β+1)(ncs+nct)ncsnct∑xj∈Xcst⟨xj,xj⟩H−(ncs+(ncs+nct)β(ncs)2nct)∑xi∈Xcs∑xj∈Xcs⟨xi,xj⟩H−(nct+(ncs+nct)βncs(nct)2)∑xi∈Xct∑xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs∑xj∈Xct⟨xi,xj⟩H] |
+1ns(ns−1)∑xi,xj∈Xsxi≠xj⟨xi,xj⟩H+1nt(nt−1)∑xi,xj∈Xtxi≠xj⟨xi,xj⟩H |
−2nsnt∑xi∈Xs∑xj∈Xt⟨xi,xj⟩H, | (33) |
Ludcwmmd(Zs,Zt)=∑Cc=1[α1∑zj∈Zcst⟨zj,zj⟩H+α2∑zi,zj∈Zcs⟨zi,zj⟩H+α3∑zi,zj∈Zct⟨zi,zj⟩H+α4∑xi∈Zj∈Zct⟨zi,zj⟩H] |
+γ1∑zi,zj∈Zszi≠zj⟨zi,zj⟩H+γ2∑zi,zj∈Ztzi≠zj⟨zi,zj⟩H+γ3∑zi∈Zs,zj∈Zt⟨zi,zj⟩H. | (34) |
A categorical cross-entropy, Lcls, is commonly used as the error for the classifier's classification results on the source domain data, as shown in formula (35). Here, ˆlcsj is the c-th element of ˆlsi=C(zsi) and ycsi is the c-th element of the ground truth one-hot label vector ysi, where ycsi=1 if the label of the original sample xsi corresponding to zsi is c, and ycsi=0 otherwise. To encourage the samples to form dense, uniform, and well-separated clusters, the label-smoothing (LS) technique [42] is applied to the cross-entropy loss. This involves substituting the smooth label (1−α)ycsi+α/C, a weighted average of ycsi and 1/C, with ycsi in the categorical cross-entropy to form the smoothed categorical cross-entropy Llscls(Zs,Ys), as shown in (36). Here, α is a smoothing factor generally set to 0.1 for better performance and C represents the number of classes. The goal of LS is to prevent the model from becoming too confident in its predictions and to reduce overfitting. Rafael Müller et al. [42] have shown that LS encourages the representations of training examples from the same class to group in tight clusters.
During training with target samples, the entropy of predicted results for those samples is minimized, as illustrated in formula (37). This strategy is employed because it has been indicated that unlabeled examples are especially beneficial when class overlap is small [43]. Minimizing this entropy encourages the predicted results to be more inclined toward a specific category, making the feature distribution between categories in the target domain more distinct and explicit. The training loss function of the entire network is defined as LDCWDA, as shown in formula (38), where ω1 and ω2 are weight parameters.
Lcls(Zs,Ys)=1ns∑nsi=1∑Cc=1ycsilogˆlcsi, | (35) |
Llscls(Zs,Ys)=1ns∑nsi=1∑Cc=1((1−α)ycsi+α/C)logˆlcsi, | (36) |
Lent(Zt)=1nt∑ntj=1∑Cc=1ˆlctjlogˆlctj, | (37) |
LDCWDA=Ludcwmmd+ω1Llscls+ω2Lent. | (38) |
The training architecture of the proposed discriminative class-wise domain adaptation (DCWDA) system, as shown in Figure 2, consists of a feature extractor (F) used for extracting domain-invariant features and a classifier (C). The feature extractor F and the classifier C are duplicated to represent the data paths of the source domain and the target domain, and a dotted line is drawn in the middle to indicate shared weights. During training, the source domain samples xs and target domain samples xt are first separately input into the feature extractor F, which outputs features zs=F(xs) and zt=F(xt). The discriminative class-wise loss function Ludcwmmd is then computed for zs and zt. Subsequently, zs and zt are separately input into the classifier C, producing classification results ˆls=C(zs) and ˆlt=C(zt). This allows the calculation of the cross-entropy Lcls for the predicted result for the source domain sample xs and the entropy Lent for the predicted result of the target domain sample xt. The training algorithm is shown in Algorithm 1, where the batch sizes of both the source sample and the target sample are set to N.
Algorithm 1. Training the DCWDA model. |
Input: Δs,Δt,α,ω1,ω2,η2; Initialize parameters θF and θC; # train the model parameters θF and θC on Δs and Δt; repeat until convergence (Xs,Ys)={(xs1,ys1),(xs2,ys2),…,(xsN,ysN)}← mini-batch from Δs; Xt={xt1,xt2,…,xtN}← mini-batch from Δt; Zs← F (Xs); Zt← F (Xt); # generate pseudo labels: ˆLt={ˆlt1,ˆlt2,…,ˆltN} ← C(F(Xt)); # classifier target sample Yt={yt1,yt2,…,ytN} ={M(ˆlt1),M(ˆlt2),…,M(ˆltN)}; # obtain pseudo labels # M((v1,v2,…,vC)) =argmax1≤c≤Cvc; # evaluate losses: Ludcwmmd(Xs,Xt)=…; # using (29) Llscls←1N∑Ni=1∑Cc=1((1−α)ycsi+α/C)logˆlcsj; # using (36) Lent← 1N∑Nj=1∑Cc=1ˆlctjlogˆlctj; # using (37) LDCWDA←Ludcwmmd+ ω1Llscls+ω2Lent; # update θF and θC to minimize LDCWDA; θF←θF−η2∇θFLDCWDA; θC←θC−η2∇θCLDCWDA; end repeat |
The proposed method was evaluated using digit datasets and office object data. The digit datasets used in the experiments include modified national institute of standards and technology database (MNIST) [44], U.S. postal service (USPS) [45], and street view house numbers (SVHN) [46]. MNIST and USPS are handwritten datasets. MNIST has 60, 000 training samples and 10, 000 testing samples, all grayscale images of size 28×28. USPS consists of 9, 298 grayscale images of size 16x16. SVHN contains 73, 257 training images and 26, 032 test images, which are color images of size 32×32 captured from street-view house number photo images. For each image, the digit to be recognized are a single digit in a house number located in the center of the image, surrounded by other digits or distracting objects. In the experiment, the images are scaled to a size of 32×32 pixels. Figure 2 shows some images from MNIST, USPS, and SVHN, and the image in each blue frame is used as a training sample. Figure 3 displays some images from MNIST, USPS, and SVHN, where the numbers within blue frames in SVHN images are the digits to be recognized. The Office-31 [47] dataset comprises three domains: Amazon (A), DSLR (Digital Single - Lens Reflex) (D), and Webcam (W). Each domain comprises 31 object categories in an office environment, totaling 4, 110 images, with varying numbers of images for each category. Figure 4 displays some images from Webcam, DSLR, and Amazon.
In the training process, the batch sizes for the digit dataset and Office-31 dataset are set to 128 and 64, respectively. Resnet-18 and Resnet-50 [1] are adopted as the network architectures for the feature extractors on the digit dataset and the Office-31 dataset, respectively. Both architectures undergo fine-tuning with pre-trained ImageNet network parameters. In addition, the pseudo-labels of all target domain training data are updated with the current classifier parameters at the beginning of each epoch.
The accuracies of various combinations of source domain and target domain were evaluated. The combinations for digital datasets include: MNIST to USPS (M → U), USPS to MNIST (U → M), and SVHN to MNIST (S → M). The combinations for Office-31 datasets include: Amazon to DSLR (A → D), Amazon to Webcam (A → W), DSLR to Amazon (D → A), DSLR to Webcam (D → W), Webcam to Amazon (W → A), and Webcam to DSLR (W → D). Table 1 compares the proposed method with various unsupervised domain adaptation methods on the digit datasets, including adversarial discriminative domain adaptation (ADDA) [17], adversarial dropout regularization (ADR) [48], conditional adversarial domain adaptation (CDAN) [49], cycle-consistent adversarial domain adaptation (CyCADA) [50], sliced wasserstein discrepancy (SWD) [51] and source hypothesis transfer (SHOT) [36]. Table 2 compares the proposed method with various unsupervised domain adaptation methods on the Office-31dataset, including: Wang et al. [33], deep adaptation networks (DAN) [18], domain-adversarial neural network (DANN) [16], ADDA [17], multi-adversarial domain adaptation (MADA) [52], SHOT [36], collaborative and adversarial network (CAN) [14] and mini-batch dynamic geometric embedding (MDGE) [23]. Each accuracy represents the average accuracy rate of three test results. The best-performing methods for each source-to-target combination are highlighted in bold. The "Source-only" category indicates that the classifier is directly trained using the source domain data without domain adaptation, and then tested using the target domain data. The "Target-supervised" category shows that the classifier is directly trained using the target domain data and tested using the target data. Typically, the accuracies of "source-only" and "target supervised" serve as the lower and upper bounds for domain adaptation accuracy, but there's no guarantee that the accuracy will fall within this range.
As can be seen from Table 2, the proposed method outperforms other methods in testing most digital dataset pairs except S → M, and achieves the highest average accuracy. It is worth noting that SVHN images have obvious color changes and noise. Compared with other digital imaging datasets, the USPS is a smaller digital dataset with smaller images. Hence, the test results for the combinations of USPS and SVHN are not very informative. In view of these results, the datasets M → S, S → U, and U → S are not used in the digital dataset experiment. As can be seen from Table 3, the proposed method outperforms other methods in testing two of the three digital dataset combinations and achieves the highest average accuracy.
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |
In this paper, we tackled the domain adaptation problem by using a deep network architecture with a DCWMMD as a loss function. The MMD used is based on embedding distribution metrics in the reproducing kernel Hilbert space. This not only leverages the kernel trick to enhance computational efficiency but also conforms to the original MMD definition. Marginal MMD helps align the data distributions regardless of class alignment. To alleviate this limitation, CWMMD was introduced to align data distributions of the same class from the two domains. However, this adjustment may lead to a reduction in feature discriminativeness. By deconstructing CWMMD into variance minus intra-class distance, an adjustable weight parameter for the intra-class distance term was introduced, providing flexibility to preserve feature discriminability. The experimental results show that our proposed method improves upon the approach proposed by Wang et al. [33]. In terms of the error function, we not only applied the LS technique to the cross entropy for the training of the source domain, but we also added the entropy of the predicted label for the target samples to enhance the overall training performance. The proposed architecture was evaluated using two datasets, the digital dataset and Office-31 dataset. The results demonstrate competitive accuracy rates for domain adaptation when compared to other methods.
In the future, we will continue to improve the performance of training process in the system, such as applying data augmentation to increase the diversity of data, using high-confidence data from the target domain to provide pseudo-labels for supervised post-processing training, etc. Last but not least, we also want to apply our work to other domain adaptation tasks, such as face recognition, object recognition, and image-to-image translation.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the National Science and Technology Council, Taiwan, R.O.C. under the grant NSTC 112-2221-E-032-041.
The authors declare no conflict of interest.
[1] |
Abdollahi H (2020) A Novel Hybrid Model for Forecasting Crude Oil Price Based on Time Series Decomposition. Appl Energy 267: 115035. https://doi.org/10.1016/j.apenergy.2020.115035 doi: 10.1016/j.apenergy.2020.115035
![]() |
[2] | Aho K, Derryberry DW, Peterson T (2016) Model Selection for Ecologists: The Worldviews of Aic and Bic. Ecol 95: 631-636. https://www.jstor.org/stable/43495189 |
[3] |
Ajmi AN, Hammoudeh S, Mokni K (2021) Detection of Bubbles in Wti, Brent, and Dubai Oil Prices: A Novel Double Recursive Algorithm. Resour Policy 70: 101956. https://doi.org/10.1016/j.resourpol.2020.101956 doi: 10.1016/j.resourpol.2020.101956
![]() |
[4] |
Azadeh A, Moghaddam M, Khakzad M, et al. (2012) A Flexible Neural Network-Fuzzy Mathematical Programming Algorithm for Improvement of Oil Price Estimation and Forecasting. Comput Indl Eng 62: 421-30. https://doi.org/10.1016/j.cie.2011.06.019 doi: 10.1016/j.cie.2011.06.019
![]() |
[5] |
Butler S, Kokoszka P, Miao H, et al. (2021) Neural Network Prediction of Crude Oil Futures Using B-Splines. Energy Econ 94: 105080. https://doi.org/10.1016/j.eneco.2020.105080 doi: 10.1016/j.eneco.2020.105080
![]() |
[6] |
Chen L, Zhang Z, Chen F, et al. (2019) A Study on the Relationship between Economic Growth and Energy Consumption under the New Normal. Natl Account Rev 1: 28-41. https://doi.org/10.3934/nar.2019.1.28 doi: 10.3934/NAR.2019.1.28
![]() |
[7] |
Chiroma H, Abdulkareem S, Herawan T (2015) Evolutionary Neural Network Model for West Texas Intermediate Crude Oil Price Prediction. Appl Energy 142: 266-273. https://doi.org/10.1016/j.apenergy.2014.12.045 doi: 10.1016/j.apenergy.2014.12.045
![]() |
[8] |
Fan D, Sun H, Yao J, et al. (2021) Well Production Forecasting Based on Arima-Lstm Model Considering Manual Operations. Energy 220: 119708. https://doi.org/10.1016/j.energy.2020.119708 doi: 10.1016/j.energy.2020.119708
![]() |
[9] |
Fischer T, Krauss C (2018) Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions. Eur J Oper Res 270: 654-669. https://doi.org/10.1016/j.ejor.2017.11.054 doi: 10.1016/j.ejor.2017.11.054
![]() |
[10] |
Gori F, Ludovisi D, Cerritelli P (2007) Forecast of Oil Price and Consumption in the Short Term under Three Scenarios: Parabolic, Linear and Chaotic Behaviour. Energy 32: 1291-1296. https://doi.org/10.1016/j.energy.2006.07.005 doi: 10.1016/j.energy.2006.07.005
![]() |
[11] |
Grace SP, Kanamura T (2020) Examining Risk and Return Profiles of Renewable Energy Investment in Developing Countries: The Case of the Philippines. Green Financ 2: 135-150. https://doi.org/10.3934/gf.2020008 doi: 10.3934/GF.2020008
![]() |
[12] | Graves A (2012) Long Short-Term Memory, A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks. Berlin, Heidelberg: Springer Berlin Heidelberg, 37-45. |
[13] |
He K, Yu L, Lai KK (2012) Crude Oil Price Analysis and Forecasting Using Wavelet Decomposed Ensemble Model. Energy 46: 564-574. https://doi.org/10.1016/j.energy.2012.07.055 doi: 10.1016/j.energy.2012.07.055
![]() |
[14] |
Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural comput 9: 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
![]() |
[15] |
James DH (2009) Causes and Consequences of the Oil Shock of 2007-08. Brookings Papers on Economic Activity 215-261. https://doi.org/10.1353/eca.0.0047 doi: 10.1353/eca.0.0047
![]() |
[16] |
Lammerding M, Stephan P, Trede M, et al. (2013) Speculative Bubbles in Recent Oil Price Dynamics: Evidence from a Bayesian Markov-Switching State-Space Approach. Energy Econ 36: 491-502. https://doi.org/10.1016/j.eneco.2012.10.006 doi: 10.1016/j.eneco.2012.10.006
![]() |
[17] |
Li T, Liao G (2020) The Heterogeneous Impact of Financial Development on Green Total Factor Productivity. Front Energy Res 8: 29. https://doi.org/10.3389/fenrg.2020.00029 doi: 10.3389/fenrg.2020.00029
![]() |
[18] |
Li T, Zhong J, Huang Z (2020a) Potential Dependence of Financial Cycles between Emerging and Developed Countries: Based on Arima-Garch Copula Model. Emerg Mark Financ Trade 56: 1237-1250. https://doi.org/10.1080/1540496X.2019.1611559 doi: 10.1080/1540496X.2019.1611559
![]() |
[19] |
Li X, Shang W, Wang S (2019) Text-Based Crude Oil Price Forecasting: A Deep Learning Approach. Int J Forecasting 35: 1548-60. https://doi.org/10.1016/j.ijforecast.2018.07.006 doi: 10.1016/j.ijforecast.2018.07.006
![]() |
[20] |
Li Z, Dong H, Floros C, et al. (2021) Re-Examining Bitcoin Volatility: A Caviar-Based Approach. Emerg Mark Financ Trade: 1-19. https://doi.org/10.1080/1540496X.2021.1873127 doi: 10.1080/1540496X.2021.1873127
![]() |
[21] |
Li Z, Wang Y, Huang Z (2020b) Risk Connectedness Heterogeneity in the Cryptocurrency Markets. Front Phys 8: 243. https://doi.org/10.3389/fphy.2020.00243 doi: 10.3389/fphy.2020.00243
![]() |
[22] |
Lin Y, Xiao Y, Li F (2020) Forecasting Crude Oil Price Volatility Via a Hm-Egarch Model. Energy Econ 87: 104693. https://doi.org/10.1016/j.eneco.2020.104693 doi: 10.1016/j.eneco.2020.104693
![]() |
[23] |
Lu Q, Li Y, Chai J, et al. (2020) Crude Oil Price Analysis and Forecasting: A Perspective of "New Triangle". Energy Econs 87: 104721. https://doi.org/10.1016/j.eneco.2020.104721 doi: 10.1016/j.eneco.2020.104721
![]() |
[24] |
Mostafa MM, El-Masry AA (2016) Oil Price Forecasting Using Gene Expression Programming and Artificial Neural Networks. Econ Model 54: 40-53. https://doi.org/10.1016/j.econmod.2015.12.014 doi: 10.1016/j.econmod.2015.12.014
![]() |
[25] |
Murat A, Tokat E (2009) Forecasting Oil Price Movements with Crack Spread Futures. Energy Econ 31: 85-90. https://doi.org/10.1016/j.eneco.2008.07.008 doi: 10.1016/j.eneco.2008.07.008
![]() |
[26] |
Nonejad N (2020) Should Crude Oil Price Volatility Receive More Attention Than the Price of Crude Oil? An Empirical Investigation Via a Large-Scale out-of-Sample Forecast Evaluation of Us Macroeconomic Data. J Forecasting. https://doi.org/10.1002/for.2738 doi: 10.1002/for.2738
![]() |
[27] |
Ouyang ZS, Yang XT, Lai Y (2021) Systemic Financial Risk Early Warning of Financial Market in China Using Attention-Lstm Model. North Am J Econ Financ 56: 101383. https://doi.org/10.1016/j.najef.2021.101383 doi: 10.1016/j.najef.2021.101383
![]() |
[28] |
Pabuçcu H, Ongan S, Ongan A (2020) Forecasting the Movements of Bitcoin Prices: An Application of Machine Learning Algorithms. Quant Financ Econ 4: 679-692. https://doi.org/10.3934/qfe.2020031 doi: 10.3934/QFE.2020031
![]() |
[29] |
Ramyar S, Kianfar F (2017) Forecasting Crude Oil Prices: A Comparison between Artificial Neural Networks and Vector Autoregressive Models. Comput Econ 53: 743-761. https://doi.org/10.1007/s10614-017-9764-7 doi: 10.1007/s10614-017-9764-7
![]() |
[30] |
Shibata R (1976) Selection of the Order of an Autoregressive Model by Akaike's Information Criterion. Biometrika 63: 117-126. https://doi.org/10.1093/biomet/63.1.117 doi: 10.1093/biomet/63.1.117
![]() |
[31] |
Wei Y, Wang Y, Huang D (2010) Forecasting Crude Oil Market Volatility: Further Evidence Using Garch-Class Models. Energy Econ 32: 1477-1484. https://doi.org/10.1016/j.eneco.2010.07.009 doi: 10.1016/j.eneco.2010.07.009
![]() |
[32] |
Yu L, Dai W, Tang L, et al. (2015) A Hybrid Grid-Ga-Based Lssvr Learning Paradigm for Crude Oil Price Forecasting. Neural Comput Appls 27: 2193-2215. https://doi.org/10.1007/s00521-015-1999-4 doi: 10.1007/s00521-015-1999-4
![]() |
[33] |
Yu L, Zha R, Stafylas D, et al. (2020) Dependences and Volatility Spillovers between the Oil and Stock Markets: New Evidence from the Copula and Var-Bekk-Garch Models. Int Rev Financ Anal 68. https://doi.org/10.1016/j.irfa.2018.11.007 doi: 10.1016/j.irfa.2018.11.007
![]() |
[34] |
Zhang JL, Zhang YJ, Zhang L (2015) A Novel Hybrid Method for Crude Oil Price Forecasting. Energy Econ 49: 649-659. https://doi.org/10.1016/j.eneco.2015.02.018 doi: 10.1016/j.eneco.2015.02.018
![]() |
[35] |
Zhang Y, Ma F, Wang Y (2019) Forecasting Crude Oil Prices with a Large Set of Predictors: Can Lasso Select Powerful Predictors? J Empir Financ 54: 97-117. https://doi.org/10.1016/j.jempfin.2019.08.007 doi: 10.1016/j.jempfin.2019.08.007
![]() |
[36] |
Zhao Y, Li J, Yu L (2017) A Deep Learning Ensemble Approach for Crude Oil Price Forecasting. Energy Econ 66: 9-16. https://doi.org/10.1016/j.eneco.2017.05.023 doi: 10.1016/j.eneco.2017.05.023
![]() |
[37] |
Zheng Y, Du Z (2019) A Systematic Review in Crude Oil Markets: Embarking on the Oil Price. Green Financ 1: 328-345. https://doi.org/10.3934/gf.2019.3.328 doi: 10.3934/GF.2019.3.328
![]() |
[38] |
Zhong J, Wang M, M Drakeford B, et al. (2019) Spillover Effects between Oil and Natural Gas Prices: Evidence from Emerging and Developed Markets. Green Financ 1: 30-45. https://doi.org/10.3934/gf.2019.1.30 doi: 10.3934/GF.2019.1.30
![]() |
1. | Takao Komatsu, Two types of hypergeometric degenerate Cauchy numbers, 2020, 18, 2391-5455, 417, 10.1515/math-2020-0030 | |
2. | Takao Komatsu, Continued fraction expansions of the generating functions of Bernoulli and related numbers, 2020, 31, 00193577, 695, 10.1016/j.indag.2020.06.006 | |
3. | Takao Komatsu, Wenpeng Zhang, Several expressions of truncated Bernoulli-Carlitz and truncated Cauchy-Carlitz numbers, 2020, 5, 2473-6988, 5939, 10.3934/math.2020380 | |
4. | Takao Komatsu, Ram Krishna Pandey, On hypergeometric Cauchy numbers of higher grade, 2021, 6, 2473-6988, 6630, 10.3934/math.2021390 | |
5. | Narakorn Rompurk Kanasri, Takao Komatsu, Vichian Laohakosol, Cameron’s operator in terms of determinants and hypergeometric numbers, 2022, 28, 1405-213X, 10.1007/s40590-021-00401-8 | |
6. | Beáta Bényi, Toshiki Matsusaka, Combinatorial aspects of poly-Bernoulli polynomials and poly-Euler numbers, 2023, 34, 2118-8572, 917, 10.5802/jtnb.1234 | |
7. | James C. Fu, Wan-Chen Lee, Hsing-Ming Chang, On Distribution of the Number of Peaks and the Euler Numbers of Permutations, 2023, 25, 1387-5841, 10.1007/s11009-023-09987-0 | |
8. | Takao Komatsu, Guo-Dong Liu, Congruence properties of Lehmer-Euler numbers, 2025, 0001-9054, 10.1007/s00010-024-01150-5 |
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |