In distributional reinforcement learning (RL), not only expected returns but the complete return distributions of a policy are taken into account. The return distribution for a fixed policy is given as the solution of an associated distributional Bellman equation. In this note, we consider general distributional Bellman equations and study the existence and uniqueness of their solutions, as well as tail properties of return distributions. We give necessary and sufficient conditions for the existence and uniqueness of return distributions and identify cases of regular variation.
We link distributional Bellman equations to multivariate affine distributional equations. We show that any solution of a distributional Bellman equation can be obtained as the vector of marginal laws of a solution to a multivariate affine distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.
Citation: Julian Gerstenberg, Ralph Neininger, Denis Spiegel. On solutions of the distributional Bellman equation[J]. Electronic Research Archive, 2023, 31(8): 4459-4483. doi: 10.3934/era.2023228
[1] | Zhiyong Qian, Wangsen Xiao, Shulan Hu . The generalization ability of logistic regression with Markov sampling. Electronic Research Archive, 2023, 31(9): 5250-5266. doi: 10.3934/era.2023267 |
[2] | Mengke Lu, Shang Gao, Xibei Yang, Hualong Yu . Improving performance of decision threshold moving-based strategies by integrating density-based clustering technique. Electronic Research Archive, 2023, 31(5): 2501-2518. doi: 10.3934/era.2023127 |
[3] | Zhuang Wang, Renting Liu, Jie Xu, Yusheng Fu . FedSC: A federated learning algorithm based on client-side clustering. Electronic Research Archive, 2023, 31(9): 5226-5249. doi: 10.3934/era.2023266 |
[4] | Qianpeng Xiao, Changbin Shao, Sen Xu, Xibei Yang, Hualong Yu . CCkEL: Compensation-based correlated k-labelsets for classifying imbalanced multi-label data. Electronic Research Archive, 2024, 32(5): 3038-3058. doi: 10.3934/era.2024139 |
[5] | Shuaian Wang, Xuecheng Tian, Ran Yan, Yannick Liu . A deficiency of prescriptive analytics—No perfect predicted value or predicted distribution exists. Electronic Research Archive, 2022, 30(10): 3586-3594. doi: 10.3934/era.2022183 |
[6] | Zhen-Zhen Tao, Bing Sun . A feedback design for numerical solution to optimal control problems based on Hamilton-Jacobi-Bellman equation. Electronic Research Archive, 2021, 29(5): 3429-3447. doi: 10.3934/era.2021046 |
[7] | Cheng He, Changzheng Qu . Global weak solutions for the two-component Novikov equation. Electronic Research Archive, 2020, 28(4): 1545-1562. doi: 10.3934/era.2020081 |
[8] | Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239 |
[9] | Nihar Patel, Nakul Vasani, Nilesh Kumar Jadav, Rajesh Gupta, Sudeep Tanwar, Zdzislaw Polkowski, Fayez Alqahtani, Amr Gafar . F-LSTM: Federated learning-based LSTM framework for cryptocurrency price prediction. Electronic Research Archive, 2023, 31(10): 6525-6551. doi: 10.3934/era.2023330 |
[10] | Qin Guo, Binlei Cai . Learning capability of the rescaled pure greedy algorithm with non-iid sampling. Electronic Research Archive, 2023, 31(3): 1387-1404. doi: 10.3934/era.2023071 |
In distributional reinforcement learning (RL), not only expected returns but the complete return distributions of a policy are taken into account. The return distribution for a fixed policy is given as the solution of an associated distributional Bellman equation. In this note, we consider general distributional Bellman equations and study the existence and uniqueness of their solutions, as well as tail properties of return distributions. We give necessary and sufficient conditions for the existence and uniqueness of return distributions and identify cases of regular variation.
We link distributional Bellman equations to multivariate affine distributional equations. We show that any solution of a distributional Bellman equation can be obtained as the vector of marginal laws of a solution to a multivariate affine distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.
The objective in RL is to teach an agent that sequentially interacts with an environment to choose "good" actions. For each action, the agent receives an immediate real-valued reward. The rewards are accumulated over time, resulting in the so-called return, which describes the overall performance of the agent. Randomness may be involved at all levels of this interaction: in choosing actions, the environment reacting to actions, and/or in the rewards received by the agent. Hence, the return is to be considered random as well. In more classical approaches to RL problems, the randomness is averaged out, and only the expected return is considered when evaluating the performance of an agent. In [1], not only the expectation but the complete distribution of the return was considered, introducing what is now known as distributional RL, see [2].
Mathematically, an RL problem is typically modeled by a {Markov decision process} (MDP), that is, as a particular type of a discrete-time stochastic control problem. An overview of the classical MDP theory and its applications is presented in [3]. For more details on distributional RL using notations similar to here, we refer to [2,4].
For any measurable space X, we write P(X) for the set of probability distributions on X. The distribution (law) of a X-valued random variable (RV) X is denoted by L(X)=P[X∈⋅]∈P(X). We also write short X∼μ, if L(X)=μ∈P(X). In case X is countable, discrete distributions ν∈P(X) can be identified with functions ν:X→[0,1] satisfying ∑x∈Xν(x)=1. For a random variable X and an event A with P[A]>0, we write L(X|A)∈P(X) for the conditional distribution of X given A.
Let S and A be non-empty finite sets. A MDP on states S and actions A is a function
ρ:S×A→P(S×R),(s,a)↦ρ(s,a). |
The interpretation is as follows: S is the set of states an environment can occupy and A the set of possible actions the agent can perform. If in state s∈S the agent performs action a∈A, the environment reacts with a (possibly random) successor state S together with a (possibly random) real-valued reward R having joint distribution (S,R)∼ρ(s,a)∈P(S×R). How an agent chooses its actions can be modeled by a (stationary) policy
π:S→P(A),s↦πs. |
An agent that is in state s and acts according to policy π chooses a (possibly random) action A distributed as A∼πs∈P(A).
Suppose an agent acts according to π. Starting at time t=0 from a possibly random state S(0), the following dynamic is defined inductively: at time t∈N0, the agent finds itself in state S(t), chooses action A(t)∼πS(t) (independent of the past given S(t)) and the environment reacts with the next state S(t+1) and immediate reward R(t) jointly distributed as (S(t+1),R(t))∼ρ(S(t),A(t)) (independent of the past given (S(t),A(t))). The resulting stochastic process
(S(t),A(t),R(t))t∈N0 | (1.1) |
is called a Markov reward process in [2] or a full MDP in [4]. To emphasize that the distribution of the stochastic process (1.1) depends on π, we write Pπ instead of P.
A suitable way to judge the overall performance of the agent is to choose a discount factor γ∈(0,1) and to consider the discounted accumulated rewards, the so-called return:
G=R(0)+γR(1)+γ2R(2)+⋯=∞∑t=0γtR(t). |
The return G is defined as an infinite series of random variables and its existence as a R-valued random variable is not automatically guaranteed, such as if rewards are (extremely) heavy-tailed. Theorem 1 below provides a complete characterization of when the return G exists almost surely as a R-valued random variable starting from any state S(0)=s. The relation of Theorem 1 to earlier results is discussed in Remarks 2, 3 and 4. In various applications, heavy-tailed reward distributions are not encountered. For instance, if rewards are obtained as a deterministic function of states and actions, the rewards become bounded, and the existence of the return G easily follows from a geometric series argument. However, unbounded and even heavy-tailed reward distributions are of interest in various applications: fields in which RL approaches have been considered and heavy-tailed distributions play a crucial role include insurance and finance. See [5] and [6] for RL approaches to pricing and trading and [7] for the role of heavy-tailed distributions in that field. Recently, another research area in which heavy-tailed reward distributions have been considered is the study of multi-armed bandit models, which corresponds to an MDP with only a single state, |S|=1. In [8], the authors report on several studies in this direction, including linear bandits [9,10], pure exploration [11], Lipschitz-bandits [12], Bayesian optimization [13], and Thompson sampling [14]. Notably, in the latter work, a particular emphasis is placed on α-stable reward distributions, which is an assumption also covered by our analysis, see Theorem 4 and Example 3. It is worth noting that all of these works consider finite horizon non-discounted returns rather than infinite discounted returns. Our analysis offers a theoretical justification for the inclusion of heavy-tailed reward distributions in infinite discounted reward scenarios.
Remark 1. The case γ=0 is sometimes considered in RL literature, resulting in G=R(0) and trivializing many of the research questions we investigate here. Although we do not discuss this case further, it should be noted that several parts of our results remain applicable even when γ=0.
Suppose the return G exists as a R-valued RV starting from any state s. Of interest in distributional policy evaluation are the distributions of the return starting from given states. The state(-action) return distributions are defined by
ηπs=Pπ[G∈⋅|S(0)=s],s∈S,ηπ(s,a)=Pπ[G∈⋅|S(0)=s,A(0)=a],(s,a)∈S×A. |
In [2], the collection ηπ=(ηπs)s∈S is called a return distribution function. The state(-action) return distributions can be found as solutions to the distributional Bellman equations, which we now explain. For r∈R,γ∈(0,1) let fr,γ:P(R)→P(R) be the map that sends ν=L(X) to fr,γ(ν)=L(r+γX). Using the recursive structure of the return, G=R(0)+γG′ with G′=∑∞t=0γtR(t+1), and Markov properties of (1.1), it is seen that state and state-action return distributions are related by
ηπs=∑a∈Aπs(a)ηπ(s,a),s∈S,ηπ(s,a)=∫S×Rfr,γ(ηπs′)dρ(s,a)(s′,r),(s,a)∈S×A. |
Substituting the formulas in one another yields the distributional Bellman equations, which comes in two forms: one for states and one for state-actions:
ηπs=∑a∈A∫S×Rπs(a)fr,γ(ηπs′)dρ(s,a)(s′,r),s∈S,ηπ(s,a)=∫S×R∑a′∈Aπs′(a′)fr,γ(ηπ(s′,a′))dρ(s,a)(s′,r),(s,a)∈S×A. | (1.2) |
The distributional Bellman equations are a system of |S| resp. |S×A| one-dimensional distributional equations in R. The right hand side of the equations can be used to introduce the distributional Bellman operator, which for the state return distributions is defined as:
Tπ:P(R)S⟶P(R)S,η=(ηs)s∈S⟼Tπ(η)=(Tπs(η))s∈Swith s -th component functionTπs(η)=∑a∈A∫S×Rπs(a)fr,γ(ηs′)dρ(s,a)(s′,r). | (1.3) |
No assumption on the MDP ρ nor the policy π is needed to define (1.8), and the relation to the state return distributions is as follows: if the return G, starting from any state s, converges almost surely in R then the return distribution function ηπ=(ηπs)s∈S is a fixed point of Tπ, that is ηπ=Tπ(ηπ). We explore this connection in more detail.
We write log+(x)=log(x) for x≥1 and log+(x)=0 for x<1. The definition of "essential state" is given before Theorem 2. One result of this note is the following:
Theorem 1. The following are equivalent:
(i) Tπ has a fixed point η∈P(R)S,
(ii) G=∑∞t=0γtR(t) converges Pπ[⋅|S(0)=s]-almost surely in R for every initial state s∈S,
(iii) ∫S×Rlog+(|r|)dρ(s,a)(s′,r)<∞ for every pair (s,a)∈S×A that is essential with respect to the law of the Markov chain (S(t),A(t))t∈N0 under Pπ,
(iv) For every ν∈P(R)S as n→∞ the sequence (Tπ)∘n(ν) of n-th iterations of Tπ converges weakly in the product space P(R)S.
If these hold the fixed point of Tπ is unique, given by the return distribution function ηπ and (Tπ)∘n(ν)→ηπ weakly as n→∞ for every ν∈P(R)S.
A theorem that reads analogously can be formulated for the distributional Bellman operator Tπ:P(R)S×A→P(R)S×A defined in terms of state-actions. We cover both cases simultaneously by introducing simplified notations in Section 1.2 that allow to analyze the distributional Bellman operator, its fixed point equations and its solutions more conveniently. In Section 2, we present our main results in these notations, Theorem 1 will follow from the more general Theorem 2 presented there. For connections to a model and results in [15] see Remark 2.
Besides giving necessary and sufficient conditions for the existence of solutions to the distributional Bellman equations, we also study properties of their solutions, that is, of the return distribution function ηπ (and also state-action return distributions). We consider tail probability asymptotics ηπs((−∞,−x)) and ηπs((x,∞)) as x→∞. In Theorem 3, we identify cases of exponential decay and existence of p-th moments and in Theorem 4 we cover regular variation. A possible application of the latter is in distributional dynamic programming (see Chapter 5 in [2]), see Remark 6 of the present note.
Building upon our simplified notation, we explore the connection of the distributional Bellman equations to multivariate distributional equations of the form Xd=AX+B in Section 2.1. This connection seems to have not been noticed in the literature so far, and can be used to apply available results in that field to the distributional RL setting; we do that when proving Theorem 4 by applying results presented in [16].
Before we switch to the simplified notation and present the main results, we shortly present the ordinary Bellman equations for convenience.
The following is well-known and the standard setting in much of the distributional RL literature: if reward distributions ρ(s,a)(S×⋅)∈P(R) have finite expectations for all (s,a), the return converges almost surely and has finite expectations as well; this also follows from Theorems 2 and 3 presented in this note. In many classical approaches to RL, only expected returns are used for policy evaluation. The state values and state-action values are defined by
vπs=Eπ[G|S(0)=s]=∫Rgdηπs(g),qπ(s,a)=Eπ[G|S(0)=s,A(0)=a]=∫Rgdηπ(s,a)(g). |
The collection vπ=(vπs)s∈S is called a state value function and qπ a state-action value function. They can be found as solutions to the (ordinary) Bellman equations, which can be derived from their distributional counterparts (1.2) using linearity of expectation. For action a in state s, let r(s,a)∈R be the expected reward and p(s,a)∈P(S) the distribution of the next state, that is
r(s,a)=∫rdρ(s,a)(s′,r)=Eπ[R(t)|S(t)=s,A(t)=a],p(s,a)=ρ(s,a)(⋅×R)=Pπ[S(t+1)∈⋅|S(t)=s,A(t)=a]. |
The ordinary Bellman equations are a system of |S| (resp. |S×A|) linear equations in the same number of unknowns, and read as follows:
vπs=∑a∈Aπs(a)[r(s,a)+γ∑s′∈Sp(s,a)(s′)vπs′],s∈S,qπ(s,a)=r(s,a)+γ∑s′∈Sp(s,a)(s′)∑a′∈Aπs′(a′)qπ(s′,a′),(s,a)∈S×A. | (1.4) |
As in the distributional setting, one can use the right-hand side of these equations to define the (ordinary) Bellman operators (RS→RS for states) such that (1.4) are equivalent to the associated fixed point equation of these operators.
When judging a policy based on expected returns, π is considered to be at least as good as some other policy ˜π if vπ≥v˜π pointwise at each s. Classical MDP theory shows that there always exists a policy at least as good as every other policy. Being able to calculate expected returns is not just of interest for evaluating π, but also to improve it: the policy π′ defined by π′s=δargmaxa∈Aqπ(s,a) is at least as good as π. This leads to an iterative algorithm known as policy iteration: starting from an initial policy π(0) and letting π(k+1)=(π(k))′ it can be shown that vπ(k) converges as k→∞ to the state value function of an optimal policy.
We introduce simplified notations and define the distributional Bellman operator in this setting. Let d∈N and (R1,J1),…,(Rd,Jd) be random pairs such that each Ri is real-valued and Ji is discrete taking values in [d]={1,…,d}. For each i∈[d], it is L(Ri,Ji)∈P(R×[d]) the joint distribution of the random pair (Ri,Ji). The random pairs (R1,J1),…,(Rd,Jd) together with some constant γ∈(0,1) define (a version of) the distributional Bellman operator as follows:
T:P(R)d⟶P(R)d,η=(η1⋮ηd)⟼T(η)=(T1(η)⋮Td(η)), |
the i-th coordinate function Ti:P(R)d→P(R) being
Ti(η)=Ti(η1,…,ηd)=L(Ri+γGJi), | (1.5) |
where on the right hand side L(Gj)=ηj for each j∈[d] and G1,…,Gd,(Ri,Ji) are independent.
Example 1 (State return). Let ρ be a MDP and π a stationary policy. Let d=|S| and choose an arbitrary enumeration i=s to identify [d] with S. Let i=s,j=s′ and B⊆R be measurable. Consider the joint distribution
P[Ri∈B,Ji=j]=Pπ[R(t)∈B,S(t+1)=s′|S(t)=s]=∑a∈Aπs(a)ρ(s,a)({s′}×B). |
In this case T in (1.5) is the same as Tπ in (1.3).
Example 2 (State-action return). Let ρ be a MDP and π a stationary policy. Let d=|S×A| and choose an arbitrary enumeration i=(s,a) to identify [d] with S×A. Let i=(s,a),j=(s′,a′) and B⊆R measurable. Consider the joint distribution
P[Ri∈B,Ji=j]=Pπ[R(t)∈B,S(t+1)=s′,A(t+1)=a′|S(t)=s,A(t)=a]=ρ(s,a)({s′}×B)πs′(a′). |
In this case, the distributions L(Ri,Ji),i∈[d] completely encode both ρ and π and, provided existence, state-action return distributions (ηπ(s,a))(s,a)∈S×A are fixed points of T.
The connection to distributional RL shows that it is of interest to understand the fixed points of T, that is probability distributions η∈P(R)d satisfying
η=T(η). | (1.6) |
Let G1,…,Gd be real-valued random variables with ηi=L(Gi). Then, (1.6) is equivalent to the distributional Bellman equations, which is a system of d one-dimensional distributional equations
Gid=Ri+γGJi,i∈[d], | (1.7) |
where d= (also denoted by =d) denotes equality in distribution and in the i-th equation G1,…,Gd,(Ri,Ji) are independent, i∈[d]. For i,j∈[d] let
pij=P[Ji=j]andμij=L(Ri|Ji=j), |
where in case pij=0 we set μij=δ0, the Dirac measure in 0. It holds P[Ri∈⋅,Ji=j]=pijμij(⋅). Let Fi(x)=P[Gi≤x] be the cumulative distribution function (cdf) of real-valued random variable Gi. A third equivalent formulation of the distributional Bellman equations (1.7) in terms of cdf's is
Fi(⋅)=d∑j=1pij∫∞−∞Fj(⋅−rγ)dμij(r),i∈[d]. |
The distributional Bellman equations in terms of cdfs appeared in the literature before, for instance, in [17] or, in a more specific form, in [18].
We study the following aspects regarding solutions to the distributional Bellman equations, i.e., fixed points of T:
● Existence and uniqueness for which conditions being necessary and sufficient are given in Theorem 2, see also Remark 3.
● Tail probability asymptotics, that is, if η is a fixed point of T with i-th component ηi=L(Gi) we study the asymptotic behavior of P[Gi>x] and P[Gi<−x] as x→∞. Depending on properties of the distributions L(R1,J1),…,L(Rd,Jd), we identify cases of
● exponential decay and existence of p-th moments, see Theorem 3.
● regular variation, see Theorem 4.
In Section 2.1, we explain how our findings relate to the area of multivariate distributional fixed point equations of the form
Xd=AX+B, | (1.8) |
in which X,B are random d-dimensional vectors, A a random d×d matrix, and X and (A,B) are independent. Such equations arise in many different applications, such as in the context of perpetuities and random difference equations. In the above notation, the distributional Bellman equations (1.7) are a system of d one-dimensional distributional equations which is less restrictive as a single distributional equation in Rd such as (1.8), but more complicated then a single one-dimensional distributional equation. However, using Theorem 2, we show how general results from the multivariate setting become applicable to study distributional Bellman equations, see Corollary 1.
In case d=1, the random pairs (R1,J1),…,(Rd,Jd) reduce to a single real-valued random variable (R1,J1) with J1≡1 and the distributional Bellman equations (1.7) reduce to a single distributional equation in R:
G1d=γG1+R1, |
with G1 independent of R1. This is a simple special case of a one-dimensional distributional equation X=dAX+B, also called perpetuity equation, for which the existence and uniqueness of solutions as well as their properties have been studied in great detail, see, e.g., [19,20] or [21]. A comprehensive discussion is given in Chapter 2 in [16].
Let γ∈(0,1) and (R1,J1),…,(Rd,Jd) be the R×[d]-valued random pairs defining the Bellman operator T:P(R)d→P(R)d as in (1.12). Some results can be formulated and proven within a special construction based on random variables
(It)t∈N0,(Rijt)(i,j)∈[d]2,t∈N0, | (2.1) |
in which
● (It)t∈N0 is a [d]-valued homogeneous Markov chain having transition probabilities (pij)(i,j)∈[d]2 and a uniform starting distribution, that is, for any path i0,i1,…,iT∈[d] it holds that
P[I0=i0,…,IT=iT]=d−1pi0i1pi1i2⋯piT−1iT, |
● (Rijt)(i,j)∈[d]2,t∈N0 forms an array of independent real-valued random variables with
L(Rijt)=L(Ri|Ji=j)=μij, |
in particular, for each pair (i,j), the sequence (Rijt)t∈N0 is independent identically distributed (iid) with distribution μij,
● (It)t∈N0 and (Rijt)(i,j)∈[d]2,t∈N0 are independent.
From now on, we omit quantification of indices in array-like quantities once the index-ranges have been introduced.
Using RVs (2.1), an explicit description of the n-th iteration of T defined inductively as T∘n=T∘(n−1)∘T can be given: the i-th component of T∘n is the map T∘ni:P(R)d→P(R) defined by
T∘ni(η1,…,ηd)=L(∑n−1t=0γtRIt,It+1,t+γnGIn|I0=i), |
with L(Gj)=ηj and G1,…,Gd,(Rijt)ijt,(It)t independent.
To introduce some basic definitions from Markov chain theory, let i,j∈[d] be states. We write i→j if there exists t∈N0 with P[It=j|I0=i]>0. State i is essential if for every state j the implication i→j⇒j→i holds. Since the state space [d] is finite, a state i is essential if and only if it is recurrent, that is, P[It=ifor somet>0|I0=i]=1, which implies P[It=ifor ∞−many t|I0=i]=1.
Theorem 2. The following conditions are equivalent:
(i) T has a (unique) fixed point,
(ii) E[log+|Ri|]<∞ for all essential i,
(iii) The infinite series G=∑∞t=0γtRIt,It+1,t is almost surely (absolutely) convergent,
(iv) T∘ni(ν) converges weakly as n→∞ for each i∈[d] for some (all) ν∈P(R)d.
If one and hence all of (i)–(iv) hold, then the unique fixed point η∈P(R)d of T has i-th component
ηi=L(G|I0=i)=L(∑∞t=0γtRIt,It+1,t|I0=i). |
Remark 2. In [15], a model of iterations of random univariate affine linear maps within a Markovian environment is introduced and its stability is studied. This model is similar to our version of the distributional Bellman operator in (1.5). In [15], the Markov chain is allowed to have countable state space but required to be ergodic whereas the present Markov chain in the distributional RL setting has finite spate space and no further restrictions. The results in Section 3 of [15] partly imply claims of the present Theorem 2 with similar underlying techniques based on [19,21].
Note that for a real-valued random variable X, the following much stronger properties each imply E[log+|X|]<∞:
1) |X| is bounded: |X|≤K for some constant K≥0,
2) |X| has a finite exponential moment: E[exp(β|X|)]<∞ for some β>0,
3) |X| has a finite p-th moment: E[|X|p]<∞ for some p∈[1,∞).
Of course, we have 1)⇒2)⇒3). Hence, by Theorem 2, a sufficient condition for the existence of a fixed point of T is that every Rj,j∈[d] satisfies one of the latter three properties. Moreover, supposing T has a fixed point η with i-component ηi=L(Gi), these properties are transferred from Rj,j∈[d] to Gi, the subsequent theorem presents a concise summary of these well-known facts:
Theorem 3. Let i∈[d]. If for all j with i→j
1) |Rj|≤K then |Gi|≤11−γK,
2) E[exp(β|Rj|)]<∞ then E[exp(β|Gi|)]<∞,
3) E[|Rj|p]<∞ then E[|Gi|p]<∞.
The first property transfer is fundamental in MDP theory, while property transfers of the second type are well-established in perpetuity equation theory (cf. Theorem 2.4.1 in [16]). The third is common throughout the Distributional RL literature (cf. Remark 3). For completeness, a short proof of Theorem 3 is presented in Section 3.2.
Remark 3. For p≥1, let Pp(R)⊂P(R) be the subset of p-integrable distributions on R, that is, Pp(R)={μ∈P(R)|∫R|x|pdμ(x)<∞}. Let pri:Rd→R be the i-th projection. The p-th Wasserstein distance on Pp(R) is defined as
dp(μ1,μ2)=inf{∫R×R|x−y|pdψ(x,y)|ψ∈P(R2)withψ∘pr−1i=μi,i=1,2}1/p. |
The space (Pp(R),dp) is a complete metric space and so is the product (Pp(R)d,d′p) with d′p(η,ν)=maxi∈[d]dp(ηi,νi). In [1], the following was shown by extending techniques of [22]: if for all j∈[d] it holds that E[|Rj|p]<∞, that is L(Rj)∈Pp(R), then the restriction T|Pp(R)d of T to Pp(R)d maps to Pp(R)d and is a γ-contraction with respect to d′p. Banach's fixed point theorem yields that the operator T|Pp(R)d:Pp(R)d→Pp(R)d has a unique fixed point η∈Pp(R)d. The latter also follows from our results: if E[|Rj|p]<∞ for all j then E[log+|Rj|]<∞ for all j and, by Theorem 2, there exists a unique fixed point η∈P(R)d. Now L(Rj)∈Pp(R) for all j implies η∈Pp(R)d⊂P(R)d by Theorem 3. Note, however, that our Theorem 2 improves upon results of [1] based on the dp metric twofold: First, Theorem 2 has necessary and sufficient conditions for the existence and uniqueness of fixed points of T without moment assumption such as E[|Rj|p]<∞ for some p≥1. Second, the uniqueness of the fixed point is shown in the full space P(R)d, whereas the earlier results show uniqueness only in the subspace Pp(R)d⊂P(R)d.
Remark 4. Theorem 2 shows that, if it exists, the fixed point η of T can be obtained as the weak limit of the iterates T∘n(ν) as n→∞. This was observed in the distributional RL literature before, see Proposition 4.34 in Chapter 4 of [2], and the proof of this part of the theorem shares many similarities with their arguments.
Next, we focus on asymptotic behavior of tail probabilities P[X>x] and P[X<−x] as x→∞. Each of the three properties stated in Theorem 3 above directly yields asymptotic bounds, in the latter two cases by applying Markov's inequality:
1) if |X|≤K then P[X>x]=P[X<−x]=0 for all x>K,
2) if E[exp(β|X|)]<∞ for some β>0 then
lim supx→∞logP[X>x]x≤−βandlim supx→∞logP[X<−x]x≤−β, |
3) if E[|X|p]<∞ for some p∈[1,∞) then P[X>x],P[X<−x] are O(x−p) as x→∞.
With this at hand, in each of the three cases presented in Theorem 3, one can obtain asymptotic bounds for the tail probabilities P[Gi>x] and P[Gi<−x] as x→∞.
Besides the three classical properties discussed so far, a further interesting and important property of (the distribution of) a random variable X implying E[log+|X|]<∞ is that of regular variation. Regularly varying distributions represent a broad and adaptable class of heavy-tailed distributions, containing several important distribution families as specific cases, see Example 3. As mentioned in the introduction, previous applied works have investigated scenarios that involve heavy-tailed reward distributions. Consequently, an in-depth understanding of how heavy-tailed reward behavior is reflected in the returns is of interest, a question that Theorem 4 answers in the case of regular variation.
Regular variation is directly defined in terms of the asymptotic behavior of the tail probabilities P[X>x] and P[X<−x] as x→∞. First, a function f:(0,∞)→[0,∞) with f(x)>0 for all x>x0 is called regularly varying with index β∈R if it is measurable and
limx→∞f(tx)f(x)=tβfor everyt>0. |
A function L:(0,∞)→[0,∞) is called slowly varying if it is regularly varying with index β=0, so L(tx)/L(x)→1 for every t>0. Every regularly varying function f with index β is of the form f(x)=xβL(x) for some slowly varying function L, see [23].
Following [16], a real-valued random variable X is called regularly varying with index α>0 if
a) the function x↦P[|X|>x] is regularly varying with index −α and
b) left and right tail probabilities are asymptotically balanced: for some q∈[0,1]
limx→∞P[X>x]P[|X|>x]=qandlimx→∞P[X<−x]P[|X|>x]=1−q. |
An equivalent definition reads as follows: a real-valued random variable X is regularly varying with index α>0 if and only if there exists a slowly varying function L and some q∈[0,1] such that
limx→∞P[X>x]x−αL(x)=qandlimx→∞P[X<−x]x−αL(x)=1−q. | (2.2) |
In this situation, we have P[|X|>x]=P[X>x]+P[X<−x]∼x−αL(x) as x→∞, so P[|X|>x] is regularly varying of index −α, and the tail probabilities are asymptotically balanced.
Similar to the other three properties, regular variation transfers from rewards to returns in the following sense:
Theorem 4. Let i∈[d], α>0 and L a slowly varying function. Suppose for all j with i→j, there exist qj∈[0,1] and cj∈[0,∞), with cj>0 for at least one j, such that
limx→∞P[Rj>x]x−αL(x)=qjcjandlimx→∞P[Rj<−x]x−αL(x)=(1−qj)cj. |
Then with
wij=∞∑t=0(1−γα)γαtP[It=j|I0=i], |
that is wij=P[IN=j|I0=i] with N∼Geo(1−γα) independent of (It)t, it holds that
limx→∞P[Gi>x]x−αL(x)=∑dj=1wijqjcj1−γαandlimx→∞P[Gi<−x]x−αL(x)=∑dj=1wij(1−qj)cj1−γα. | (2.3) |
In particular, Gi is regularly varying with index α.
Remark 5. For d=1 Theorem 4 is equivalent to Theorem 2.4.3 (2) in [16]. For d=1, the tail asymptotic of P[|G1|>x] as x→∞ could also be directly obtained from [24], Theorem 2.3.
Example 3. A regularly varying random variable R is called Pareto-like if P[R>x]∼qcx−α and P[R<−x]∼(1−q)cx−α for some constants c>0,q∈[0,1],α>0, that is, if in (2.4) one can choose L to be constant. Examples of Pareto-like distributions include Pareto, Cauchy, Burr, and α-stable distributions, α<2, see [7]. If there exists a non-empty subset J⊆[d] and α>0 such that for each j∈J Rj is Pareto-like with index α and for each j∉J P[|Rj|>x]=o(x−α), then Theorem 4 applies by choosing L≡1 and yields that Gi is also Pareto-like.
Remark 6 (Application in distributional dynamic programming). The goal in distributional dynamic programming is to perform distributional policy evaluation on a computer. A major problem to solve is that probability distributions ηs∈P(R) are infinite dimensional objects and hence some form of approximation has to be applied in practice. Several approaches are discussed in Chapter 5 of [2]. Additional issues arise in cases of Pareto-like tail behaviors with unbounded support as in Example 3: it is then known that (some of) the true state-value distributions are Pareto-like, which may be an information desirable to include in evaluating the policy. Theorem 4 justifies modeling the tails of the state-value distributions a priori as a Pareto-like distribution, the correct choice of asymptotic parameters is given by (2.3). We plan to report on such issues in future work.
A prevalent method for examining probabilistic behaviors of systems involves coupling techniques, wherein multiple dependent versions of a system of interest are constructed and their collective behavior is analyzed. In this section, we adapt this approach to our specific context, demonstrating how it allows for the application of results from the well-established theory of multivariate distributional fixed point equations in the study of distributional Bellman equations.
A first key insight is that the operator T:P(R)d→P(R)d, defined in (1.5), depends on the random pairs (R1,J1),…,(Rd,Jd) only through their marginal laws
(L(R1,J1),…,L(Rd,Jd))∈P(R×[d])d, |
recall the definition of T=(T1,…,Td) being
T:P(R)d⟶P(R)d,T(L(G1),…,L(Gd))=(L(R1+γGJ1),…,L(Rd+γGJd)), |
where in the i-th component the random variables G1,…,Gd,(Ri,Ji) are independent. However, the second key insight is that the independence of G1,…,Gd is not necessary for the definition: any random vector (˜G1,…,˜Gd) having the same marginal laws as (G1,…,Gd) and being independent of (Ri,Ji) satisfies L(Ri+γ˜GJi)=L(Ri+γGJi).
By a coupling of (R1,J1),…,(Rd,Jd) we understand any way the random pairs could be defined on a common probability space, thus having a joint distribution
L((R1,J1),…,(Rd,Jd))∈P((R×[d])d). | (2.4) |
A coupling (2.4) induces an operator S on P(Rd) given by
S:P(Rd)⟶P(Rd),S(L(G1,…,Gd))=L(R1+γGJ1,…,Rd+γGJd), |
where on the right hand side (G1,…,Gd) and ((R1,J1),…,(Rd,Jd)) are independent. The two key insights mentioned above yield that S and T are related by
S(ζ)∘pr−1i=Ti(ζ∘pr−11,…,ζ∘pr−1d),i∈[d],ζ∈P(Rd), | (2.5) |
with pri:Rd→R being the i-th coordinate projection and hence ζ∘pr−1i∈P(R) the i-th marginal law of ζ∈P(Rd). An immediate consequence of (2.5) of interest to us is
L(G1,…,Gd)=ζ∈P(Rd)is a fixed point of S⟹(L(G1),…,L(Gd))=(ζ∘pr−11,…,ζ∘pr−1d)∈P(R)dis a fixed point of T. | (2.6) |
Corollary 1 below implies the following converse of this implication: if T has a (unique) fixed point (L(G1),…,L(Gd)) then for any coupling (2.4) the induced operator S has a unique fixed point which, due to (2.6), takes the form of a coupling L(G1,…,Gd) of G1,…,Gd. This justifies the following approach to study (fixed points of) T: choose a convenient coupling (2.4) and study its induced operator S, in particular the marginal laws of a fixed point of S. We use this approach proving Theorem 4, where it is most convenient to consider (R1,J1),…,(Rd,Jd) to be independent.
One advantage of this coupling approach is that it enables access to numerous results from the field of multivariate fixed point equations, as we explain in the following.
Let S be induced by a coupling (2.4). Define the random d-dimensional vector
R=(R1,…,Rd)T |
and the random d×d matrix
J=(Jij)(i,j)∈[d]×[d]withJij=γ⋅1(Ji=j). |
The coupling is then completely encoded in the joint distribution L(J,R)∈P(Rd×d×Rd). Moreover, for any random vector G=(G1,…,Gd)T it holds
JG+R=(R1+γGJ1,…,Rd+γGJd)T |
and hence, if G is independent of (J,R), it holds that
S(L(G))=L(JG+R) |
and L(G)∈P(Rd) is a fixed point of S if and only if
Gd=JG+R. | (2.7) |
Note that (2.7) is a single distributional equation in Rd, whereas the distributional Bellman equations, see (1.7), are a system of d distributional equations in R. Multivariate fixed point equations like (2.7) have long been studied for general joint laws L(J,R)∈P(Rd×d×Rd) under various types of assumptions, some of them directly applicable to the situation presented here. A comprehensive overview of the theory of multivariate distributional fixed point equations and its applications, in particular to stochastic difference equations and many important stochastic time-series models, is presented in [16].
Theorem 2 can be used to show the following results, in which (J(t),R(t))t∈N0 are iid copies of (J,R).
Corollary 1. For any operator S induced by a coupling (2.4), the following statements are equivalent
(i) S has a (unique) fixed point,
(ii) S∘n(ζ) converges weakly as n→∞ for some (all) ζ∈P(Rd),
(iii) The infinite series G=∑∞t=0[∏t−1s=0J(s)]R(t) is almost surely (absolutely) convergent,
(iv) T has a (unique) fixed point,
(v) E[log+|Ri|]<∞ for all essential i.
If one, and hence all, of these hold, then the unique fixed point ζ∈P(Rd) of S is given by the law of the infinite vector-valued series L(G) in (iii) and the unique fixed point of T is the vector of marginal laws of ζ.
Before we present a short proof, we explain how this result fits the known theory of multivariate fixed points equations. Let |⋅| be the euclidean norm on Rd. We also write |⋅| for the induced operator norm on Rd×d. A result by [25] shows that for any joint law L(J,R)∈P(Rd×d×Rd) such that E[log+|J|],E[log+|R|]<∞ and such that (the distribution of) J has a strictly negative top Lyapunov exponent, that is
inft≥11tE[log|∏t−1s=0J(s)|]<0, |
the sum in (iii) converges almost surely and its law is the unique solution to (2.7). We refer to Appendix E in [16] for more information about top Lyapunov exponents. In our case, having Jij=γ1(Ji=j), and hence |∏t−1s=0J(s)|=γt, the top Lyapunov exponent of J equals log(γ)<0, so it is strictly negative, and |J|=γ implies E[log+|J|]=0<∞. Thus, the implication (v)⇒(iii) in Corollary 1 says that the condition E[log+|R|]<∞, which is equivalent to E[log+|Ri|]<∞ for all i∈[d], can be relaxed to E[log+|Ri|]<∞ for all essential i to obtain almost sure convergence of the infinite series in (iii) in our case.
The implication (ii)⇒(iii) can directly be obtained from Theorem 2.1 in [26] since |∏t−1s=0J(s)|=γt→0 (almost surely) in our case. To deduce almost sure convergence from weak convergence in Theorem 2 (implication (ii)⇒(iii) there), we present Proposition 1 in Section 3, the proof of which is based on ideas presented in [27] to prove a version of Kolmogorov's three Series theorem also involving distributional convergence.
Besides the top Lyapunov exponent another important notion in multivariate fixed point equations is that of irreducibility: the joint law L(J,R)∈P(Rd×d×Rd) is called irreducible if the only affine linear subspace H⊆Rd that fulfills P[JH+R⊆H]=1 is the complete subspace H=Rd. In case (J,R) is irreducible, the implication (i)⇒(iii) can be obtained directly from Theorem 2.4 in [28]. In our situation, irreducibility of L(J,R) is not given in any situation and Corollary 1 shows that it is also not needed to obtain the implication (i)⇒(iii).
Proof of Corollary 1. (iv)⟺(v), see Theorem 2.
(iv)⟹(iii). By Theorem 2, the infinite series ∑∞t=0γtRIt,It+1,t is almost surely (absolute) convergent. For each i∈[d], the i-th component of the vector-valued infinite series ∑∞t=0[∏t−1s=0J(s)]R(t) has the same law as L(∑∞t=0γtRIt,It+1,t|I0=i), hence every component converges almost surely (absolutely).
(iii)⟹(ii). Let ζ=L(G0) with G0 independent of (J(t),R(t))t. The n-th iteration of S can be represented as
S∘n(ζ)=L(n−1∑t=0[t−1∏s=0J(s)]R(t)+[n−1∏s=0J(s)]G0). |
Now [∏n−1s=0J(s)]G0→0 almost surely and ∑n−1t=0[∏t−1s=0J(s)]R(t) converges almost surely by assumption. So S∘n(ζ) converges in distribution, namely to the law of the infinite sum in (iii).
(ii)⟹(i). Assume S∘n(ζ)→ζ0 converges weakly for some ζ. The map S:P(Rd)→P(Rd) is continuous with respect to the topology of weak convergence, hence S(ζ0)=limnS(S∘n(ζ))=limnS∘(n+1)(ζ)=ζ0, so ζ0 is a fixed point of S.
(i)⟹(iv). See (2.5).
Let m∈N and γ∈(0,1) be fixed. Consider random variables
(Kt)t∈N0,(Xlt)l∈[m],t∈N0 | (3.1) |
such that (Kt)t is an [m]={1,…,m}-valued homogeneous Markov chain with an arbitrary starting distribution, (Xlt)lt is an array of independent real-valued random variables such that for each l∈[m] the variables (Xlt)t are iid and (Xlt)lt and (Kt)t are independent. A state l is called accessible if P[Kt=l]>0 for some t∈N0. Note that we allow any starting distribution in our formulation, hence this is not automatically satisfied. A state l is accessible essential if and only if P[Kt=lfor infinitely manyt∈N0]>0 and this probability equals 1 if conditioned on the event of at least one visit in l. In light of Theorem 2, which we proof in the next subsection, any result shown within the setting (3.1) can be applied to (2.1) by considering
[m]≃[d]2,l=(i,j),Xlt=Rijt,Kt=(It,It+1) |
and in this case working under the condition P[⋅|I0=i] is equivalent to consider the starting distribution P[K0=(i′,j)]=1(i′=i)pij.
The proof of Theorem 2 is based on the following proposition, which shares similarities with Kolmogorov's three-series theorem presented Theorem 5.18 in [27].
Proposition 1. For n∈N let Sn=∑n−1t=0γtXKt,t. Then the following are equivalent
(i) Sn converges in distribution as n→∞.
(ii) Sn converges almost surely as n→∞.
(iii) E[log+|Xlt|]<∞ for each accessible essential state l∈[L].
Proof. Every finite state space Markov chain with probability one finally takes values in one of its irreducible components, hence to show (iii)⇔(ii), we can reduce to the case that (Kt)t is irreducible, in particular, every state is accessible essential in this case.
(iii)⇒(ii): We assume E[log+|Xlt|]<∞ for all l. Let 0<c<log(1/γ) and ˜c=log(1/γ)−c>0. For each t∈N0, let
At={γt|XKt,t|≥exp(−ct)}={log+|XKt,t|≥˜ct}. |
We show ∑∞t=0P[At]<∞ and apply the Borel–Cantelli lemma. Conditioning on Kt=l and bounding P[Kt=l]≤1 yields
∞∑t=0P[At]=P[A0]+∞∑t=1P[log+|XKt,t|˜c≥t]≤P[A0]+L∑l=1∞∑t=1P[log+|Xl,t|˜c≥t]≤P[A0]+L∑l=1E[log+|Xl,t|˜c]<∞. |
The Borel–Cantelli lemma implies that Act={γt|XKt,t|<exp(−ct)} occurs for all but finitely many t almost surely, and hence, almost surely ∑∞t=0γt|XKt,t|<∞, since the tail of the infinite sum is a.s. finally dominated by that of the geometric series ∑texp(−ct)<∞. Hence, Sn is absolutely convergent and thus is convergent almost surely.
(ii)⇒(iii): We assume there is an accessible essential state l such that E[log+|Xlt|]=∞ and show that Sn diverges almost surely. We closely follow the idea in the proof of [16], Theorem 2.1.3 (also [19], proof of Lemma 1.7) and apply the root test, which states that for any real-valued sequence (at)t∈N0
lim supt∈N0|at|1/t>1⟹∞∑t=0atdiverges, |
where "diverges" means not having a finite limit. Let ˜C>1, C=˜C/γ>1 and
Bt={|γtXKt,t|1/t>˜C}={|XKt,t|1/t>C}={log+|XKt,t|>tlog(C)}. |
If we can show P[lim supt∈N0Bt]=1, the root test applies and yields the almost sure divergence of Sn and hence not (ii). We want to apply the second Borel–Cantelli lemma, but since the events Bt,t∈N0 are not independent, we have to take a little higher effort. For l∈[m],j∈N, let Tlj be the time of the j-th visit of (Kt)t in l with Tlj=∞ in case no such visit occurs. With this, it holds that
lim supt∈N0Bt=m⋃l=1lim supj∈N{log+|Xl,Tlj|>Tljlog(C)}. |
For calculating the probability of the right-hand side, we can replace Xl,Tlj with Xlj since (Xlt)lt and (Tlj)lj are independent and (Xlt)t is iid for each l. Therefore, by defining the event
Clj={log+|Xlj|>Tljlog(C)} |
we have
P[lim supt∈N0Bt]=P[L⋃l′=1lim supj∈NCl′j]≥P[lim supj∈NClj], |
where on the right hand side the state l is such that E[log+|Xlt|]=∞ which exists by assumption. We show that the right-hand side is one. Using independence of (Xlt)t and (Tlj)j Fubini's theorem implies
P[lim supj∈NClj]=∫P[lim supj∈N{log+|Xlj|>tljlog(C)}]dP(Tlj)j((tlj)j), | (3.2) |
where the integral on the right hand side is with respect to the distribution of (Tlj)j under P. Since (Kt)t is assumed to be an irreducible Markov chain it holds that 1jTlj→cl>0 almost surely as j→∞ (where cl is the inverse of the unique stationary distribution probability for state l). Let (tlj)lj be a realization of (Tlj)lj with tlj∼jcl as j→∞. If we can show that for any such realization it holds that ∑∞j=1P[log+|Xlj|>tljlog(C)]=∞, we can apply the second Borel–Cantelli lemma to conclude P[lim supj∈N{log+|Xlj|>tljlog(C)}]=1. Formula (3.2) and the fact that almost every realization (tlj)j of (Tlj)j satisfies tlj∼jcl, then would yield P[lim supj∈NCj]=1 as desired.
Since tlj∼jcl for every ε>0 there is some j0 such that tlj<(1+ε)jcl for every j>j0 which allows to conclude
∞∑j=1P[log+|Xlj|>(1+ε)jcllog(C)]=∞⟹∞∑j=1P[log+|Xlj|>tljlog(C)]=∞ |
With C′=(1+ε)cllog(C)>0, we have
∞∑j=1P[log+|Xlj|>(1+ε)jcllog(C)]≥E[⌊log+|Xlj|C′⌋]=∞, |
since by assumption E[log+|Xlj|]=∞.
(ii)⇒(i). Obvious
(i)⇒(iii). We show that if there is some accessible essential state l with E[log+|Xl,t|]=∞, then Sn does not converge in distribution. For that, we use two technical Lemmas from [27] which were used to prove a version of Kolmogorov's three series theorem. Let (X′lt)lt be an array of independent RVs, independent of (Xlt)lt,(Kt)t and distributed as (Xlt)lt. Let ˜Xlt=Xlt−X′lt. In [27] ˜Xlt is called symmetrization of Xlt. Let S′n=∑n−1t=0γtX′Kt,t and ˜Sn=Sn−S′n=∑n−1t=0γt˜XKt,t. Note that ˜Sn is not a symmetrization of Sn, because Sn,S′n are not independent as they both depend on (Kt)t. But for any fixed realization (kt)t we have ∑n−1t=0γt˜Xkt,t a symmetrization of ∑n−1t=0γtXkt,t, which is enough for our reasoning.
From E[log+|Xl,t|]=∞, one can conclude that E[log+|˜Xl,t|]=∞, and hence the already established equivalence (ii)⟺(iii) shows that ˜Sn does not converge almost surely, in fact we saw that ˜Sn diverges almost surely. Thus, using independence of (Kt)t and (˜Xkt)kt, Fubini's theorem implies
1=P[˜Sndiverges as n→∞]=∫P[∑n−1t=0γt˜Xkt,tdiverges as n→∞]dP(Kt)t((kt)t), |
hence for almost every realization (kt)t of (Kt)t the sequence ∑n−1t=0γt˜Xkt,t diverges almost surely. Theorem 5.17 in [27] about sums of independent symmetric random variables applies and shows that |∑n−1t=0γt˜Xkt,t|→∞ in probability for almost every (kt)t. Hence, for almost every realization (kt)t and any constant C>0 it holds that
limn→∞P[|n−1∑t=0γt˜Xkt,t|>C]=1. |
Since ∑n−1t=0γt˜Xkt,t is a symmetrization of ∑n−1t=0γtXkt,t, Lemma 5.19 in [27] applies and yields the following bound
P[|n−1∑t=0γt˜Xkt,t|>C]≤2P[|n−1∑t=0γtXkt,t|>C]. |
Hence, for almost every realization (kt)t of (Kt)t it holds that
lim infn→∞P[|n−1∑t=0γtXkt,t|>C]≥12. |
Fatou's lemma together with Fubini's theorem yields
lim infn→∞P[|Sn|>C]≥12. |
This holds for every C>0 and hence the sequence L(Sn),n∈N is not tight which shows that it does not converge weakly.
Proof of Theorem 2. Using the special construction let
Sn=n−1∑t=0γtRIt,It+1,t. |
For any ν∈P(R)d with i-th component νi=L(Wi), let W1,…,Wd,(Rijt)ijt,(It)t be independent. Then the representation
T∘ni(ν)=L(Sn+γnWIn|I0=i) |
holds. (i)⟺(iv). Let η∈P(R)d be a fixed point of T, hence with i-th component ηi=L(Gi) it holds that
ηi=T∘ni(η)=L(Sn+γnGIn|I0=i). |
Since γnGIn→0 almost surely, this implies that conditioned on I0=i the sequence Sn converges to ηi in distribution. In particular, if a fixed point exists, its i-th component has to be the distributional limit of Sn under I0=i, and thus, fixed points are unique. If a fixed point exists, then T∘n(ν) converges to the unique fixed point η for any ν, since γnWIn→0 almost surely holds for any random variables W1,…,Wd. Now suppose that T∘n(ν)→η holds for some ν. The map T is continuous in the product topology on P(R)d and hence T(η)=T(limnT∘n(ν))=limnT(T∘n(ν))=limnT∘(n+1)(ν)=η. So a fixed point exists which finished the equivalence (i)⟺(iv).
(iv)⟹(iii). Supposing (iv) yields that conditioned on I0=i the sequence Sn converges in distribution. Applying Proposition 1 by considering l=(i,j) and Xlt=Rijt and Kt=(It,It+1) with P[K0=(i′,j)]=1(i′=i)pij yields the almost sure convergence. Since it holds conditioned for every i, it also holds unconditioned, hence (iii).
(iii)⟹(iv). Almost sure convergence holds also conditioned on I0=i, one can now apply Proposition 1.
(iii)⟺(ii). Again, we apply the equivalence of (ii) and (iii) from Proposition 1. We only note that in the situation l=(i,j) and Kt=(It,It+1) a state l=(i,j) is essential with respect to (Kt)t iff i is essential with respect to (It)t and pij>0. Also any state l with pij>0 is accessible when starting in I0=i. Hence, one obtains that the infinite sum (iii) in Theorem 2 converges almost surely iff E[log+|Rijt|]<∞ for each essential pair (i,j), that is for each essential i and j with pij>0. Since we defined μij=δ0 for pij=0 this is equivalent to E[log+|Ri|]<∞ for every essential i.
Proposition 2. Let β>0. If E[exp(β|Xlt|)]<∞ for each accessible state l then the infinite sum S=∑∞t=0γtXKt,t exists almost surely and it holds that E[exp(β|S|)]<∞.
Proof. The infinite sum exists almost surely due to Proposition 1, as the existence of exponential moments for each accessible state clearly yields the existence of the logarithmic moment. Using the triangle inequality, that x↦exp(βx) is continuous non-decreasing and monotone convergence of expectations yields
E[exp(β|S|)]≤limn→∞E[exp(βn∑t=0γt|XKt,t|)]. |
Conditioning on (K0,…,Kn) and using independence of (Kt)t and (Xlt)lt yields
E[exp(βn∑t=0γt|XKtt|)]≤E[n∏t=0E[exp(β|Xlt|)γt]|l=Kt]. |
For each t∈N0, we have γt∈(0,1), and hence x↦x(γt) is a concave function on [0,∞). For each accessible state l we have
E[exp(β|Xlt|)γt]≤E[exp(β|Xlt|)]γt<∞. |
Let c(β)=max{E[exp(β|Xl|)]|l∈[m]accessible}. We have c(β)<∞. One can bound the right hand side of (3.10) by
n∏t=0c(β)γt=c(β)n∑t=0γt. |
Putting all together yields
E[exp(β|S|)]≤limn→∞c(β)n∑t=0γt=c(β)1/(1−γ)<∞. |
Proof of Theorem 3. Consider Gi=∑tγtRIt,It+1,t starting with I0=i. With probability one (It,It+1) realizes to a pair (j,j′) with i→j and pjj′>0.
1) We have P[|Rj|≤K]=∑j′pjj′P[|Rjj′t|≤K]. Hence, if P[|Rj|≤K]=1 for all j with i→j, then P[|Rjj′t|≤K]=1 for all (j,j′) with i→j and pjj′>0. This yields P[|RIt,It+1,t|≤K|I0=i]=1 for every t and hence |Gi|≤∑tγt|RIt,It+1,t|≤K/(1−γ) almost surely.
2) Due to Theorem 2, we can consider the situation of Proposition 2 via l=(i,j), Xlt=Rijt and Kt=(It,It+1) having starting distribution P[K0=(i,j)]=pij. In this situation, a state l=(i′,j) is accessible with respect to (Kt)t if and only if i→i′ and pi′j>0. So, Gi has the same distribution as the almost sure limit ∑∞t=0γtXKt,t and one obtains E[exp(β|Gi|)]<∞.
3) We have E[|Rj|p]=∑j′pjj′E[|Rjj′t|p] hence E[|Rj|p]<∞ for all j with i→j implies E[|Rjj′t|p]<∞ for all (j,j′) with i→j and pjj′>0. Let ||Rjj′t||p=E[|Rjj′t|p]1/p and K=max(j,j′)||Rjj′t||p where the maximum runs over all relevant pairs (j,j′), hence K<∞. Let the Markov chain (It)t be started in state I0=i. Then Gi,T=∑Tt=0γtRIt,It+1,t→Gi almost surely as T→∞ and the process (It,It+1)t only visits pairs (j,j′) with ||Rjj′t||p≤K<∞. Since ||⋅||p is a norm, for T′>T
||Gi,T′−Gi,T||p=‖T′∑t=T+1γtRIt,It+1,t‖p≤T′∑t=T+1||γtRIt,It+1,t||p≤K⋅T′∑t=T+1γt→0asmin{T,T′}→∞, |
hence (Gi,T)T∈N is a Cauchy sequence with respect to ||⋅||p. The associated Lp-space is complete, hence there is a random variable ˜Gi in Lp, that is ||˜Gi||p<∞, such that ||Gi,T−˜Gi||p→0. This mode of convergence also implies Gi,T→˜Gi in probability. The almost sure convergence Gi,T→Gi implies that Gi,T→Gi in probability. Limits with respect to convergence in probability are almost surely unique, hence ˜Gi=Gi with probability one and ||Gi||p=||˜Gi||p<∞.
Our second main theorem is the result about regular variation of fixed points of T, see Theorem 4. As noted before, our proof makes use of the well-developed theory for multivariate fixed point equations
Gd=JG+R,G and (J,R)independent. | (3.4) |
In particular, a notion of regular variation for multivariate distributions, extending the one-dimensional case, has been explored in the context of equations such as (3.4) under various assumptions on the joint law L(J,R)∈P(Rd×d×Rd). Below, we use the coupling approach explained in Section 2.1 to apply some of the available results in proving Theorem 4.
First, we introduce the multivariate notion of regular variation in random vectors and state some facts we use in the proof. We refer the reader to Appendix C in [16] for more details on the basic properties of multivariate regular variation. Let ˉR=R∪{−∞,∞} be the extended real numbers and let ˉRd0=ˉRd∖{0}. A Rd-valued random vector X is called regularly varying if there exists a non-null Radon measure μ on ˉRd0, that does not charge infinite points, such that
limx→∞P[x−1X∈C]P[|X|>x]=μ(C)for every measurable μ−continuity set C⊆ˉRd0, | (3.5) |
note that |⋅| denotes the euclidean norm on Rd. The measure μ is called the limit measure of regular variation of the random vector X. The following are well-known facts about regularly varying random vectors, again we refer to Appendix C in [16]
● If μ is the limit measure of regular variation of a random vector X, then there is a unique α>0, called the index of regular variation, such that μ(tC)=t−αμ(C) for every μ-continuity set C and t>0.
● If (3.5) is satisfied with index of regular variation α>0, then the set C={x∈Rd||x|>1} is a μ-continuity set, and hence, because {x−1X∈C}={|X|>x}, it holds μ(C)=1 and
x↦P[|X|>x] |
is a regularly varying function with index −α. Further, the set {x∈Rd|wTx>1} is a μ-continuity set for each w∈Rd,
● X is regularly varying with index α>0 if and only if there exists a distribution Ξ∈P(Sd−1) on the unit sphere Sd−1={x∈Rd||x|=1} such that for every t>0
P[|X|>tx,X|X|∈⋅]P[|X|>x]w→t−αΞ(⋅)as x→∞, |
where w→ is weak convergence of finite measures.
The following theorem can be obtained easily as a special case of Theorem 4.4.24 in [16]; note that the empty product of (random) matrices is the identity matrix.
Theorem 5. Let (R1,J1),…,(Rd,Jd) be a coupling, R=(R1,…,Rd)T and J=(Jij)ij with Jij=γ1(Ji=j), see Section 2.1. Suppose R is regularly varying with index α>0 and limit measure μR. Then the multivariate distributional fixed point equation (3.4) has a unique solution G which is regularly varying with index α. In particular, with (J(s))s∈N0 being iid copies of J, it holds
limx→∞P[x−1G∈C]P[|R|>x]=∞∑t=0E[μR({x∈Rd|(∏t−1s=0J(s))x∈C})] | (3.6) |
for every μR-continuity set C and
P[|G|>x]∼c⋅P[|R|>x]asx→∞, | (3.7) |
where c>0 equals the right hand side of (3.6) evaluated at C={x∈Rd||x|>1}.
Proof. Since C={x∈Rd||x|>1} is a μ-continuity set that satisfies P[|G|>x]=P[x−1|G|∈C] the asymptotic expression (3.7) follows from (3.6). The remaining part of the theorem is a direct consequence of Theorem 4.4.24 from [16] by noticing that the additional necessary assumptions stated in that theorem, E[|J|α]<1 and E[|J|α+δ]<∞ for some α>0 and δ>0, are satisfied in our situation of interest as Jij=γ1(Ji=j) and hence |J|=γ<1.
In Theorem 5, an arbitrary coupling (R1,J1),…,(Rd,Jd) is considered and the (multivariate) regular variation of R=(R1,…,Rd)T, which is an assumption there, transfers over to the (multivariate) regular variation of G=(G1,…,Gd)T, whose marginal laws are the solutions to the distributional Bellman equations of interest in Theorem 4. Moreover, multivariate regular variation of G can be used to show univariate regular variation of G1,…,Gd by testing (3.5) on sets of the form C={x∈Rd|±eTjx>1} with ej∈Rd being unit vectors. Note that Theorem 4 concerns solutions of the distributional Bellman operator T. Hence, the result of that theorem only depends on the marginal laws (L(R1,J1),…,L(Rd,Jd)) and hence Theorem 5, which assumes presence of a coupling satisfying certain properties, is not directly applicable there. However, as explained in Section 2.1, results about coupled situations can become useful by choosing a convenient coupling L((R1,J1),…,(Rd,Jd)) that makes certain arguments work out. In the following proof, this will be the case when an independence coupling is considered: independence of R1,…,Rd together with the assumptions of Theorem 4 implies that the random vector R=(R1,…,Rd)T is regularly varying in the multivariate sense, so Theorem 5 becomes applicable. We now present technical details:
Proof of Theorem 4. Let (R1,J1),…,(Rd,Jd) be independent. Let ej∈Rd be the j-th unit vector and
R=(R1,…,Rd)T=d∑j=1Rj⋅ej, |
that is, we express R as a sum of independent random vectors. In the following, we rely on well-established results from the theory of (multivariate) regular variation which show that regular variation is transferred to sums of independent random variables (vectors) each having this property.
First, we determine the asymptotic behavior of P[|R|>x] as x→∞: let i∈[d] be fixed. Only those pairs (Rj,Jj) with i→j are relevant for the distribution of Gi, hence we assume w.l.o.g. that i→j holds for all j, that is for every j∈[d] there exists constants qj∈[0,1] and cj∈[0,∞) such that
limx→∞P[Rj>x]x−αL(x)=qjcjandlimx→∞P[Rj<−x]x−αL(x)=(1−qj)cj, | (3.8) |
where α>0 and L is a slowly varying function. This implies for every j
cj=limx→∞P[|Rj|>x]x−αL(x)=limx→∞P[R2j>x2]x−αL(x)=limy→∞P[R2j>y]y−α/2˜L(y)with˜L(y)=L(√y). |
The function ˜L is slowly varying which shows that R2j is regularly varying with index α/2 in case cj>0 and P[R2j>x]=o(x−α/2) as x→∞ in case cj=0. In such a situation, and since R21,…,R2d are independent by choice of coupling, standard results from the theory of regular variation apply, see Section 1.3.1 in [29], and show
P[R21+⋯+R2d>y]∼P[R21>y]+⋯+P[R2d>y]asy→∞ |
and hence, by re-substituting x2 for y,
P[R21+⋯+R2d>x2]∼P[R21>x2]+⋯+P[R2d>x2]∼(c1+⋯+cd)x−αL(x)asx→∞, |
that is
limx→∞P[|R|>x]x−αL(x)=c1+⋯+cd. | (3.9) |
Next, we give an argument to show that R is regularly varying with index α (in the multivariate sense): let sgn(Rj)∈{−1,1} be the sign of Rj, so that |Rjej|=|Rj| and Rjej|Rjej|=sgn(Rj)ej in case Rj≠0. If cj>0 then Rjej is regularly varying at index α>0 in the multivariate sense, the spectral measure Ξj∈P(Sd−1) is given by
Ξj=qjδej+(1−qj)δ−ej=limx→∞P[sgn(Rj)ej∈⋅||Rj|>x]. |
Applying standard arguments, see Lemma C.3.1 in [16], it is easily seen that R=∑dj=1Rjej (a sum of independent random vectors) is regularly varying at index α. Let μR be the associated measure. We are now able to apply Theorem 5 to the solution G=(G1,…,Gd)T of the distributional equation (3.4).
First, we evaluate the limit measure μR at some sets of interest: Combining (3.8) and (3.9) yields for every j∈[d]
μR({x∈Rd|eTjx>1})=limx→∞P[Rj>x]P[|R|>x]=qjcjc1+⋯+cd. |
As in Theorem 5, let (J(s))s∈N0 be a sequence of iid copies of J. For every t∈N0 the entry of the random matrix ∏t−1s=0J(s) at position (i,j)∈[d]2 equals γt⋅1(Iit=j), where (Iit)i∈[d],t∈N0 are [d]-valued random variables such that for every i∈[d] the sequence (Iit)t has the same law as (It)t under P[⋅|I0=i].
By Theorem 5 for every i∈[d] it holds that
limx→∞P[Gi>x]P[|R|>x]=∞∑t=0E[μR({x∈Rd|eTi(∏t−1s=0J(s))x>1})]. |
Using μR(tC)=t−αμR(C) for every t∈N0 it holds
E[μR({x∈Rd|eTi(∏t−1s=0J(s))x>1})]=E[μR({x∈Rd|γteTIitx>1})]=γαt∑j∈[d]P[It=j|I0=i]μR({x∈Rd|eTjx>1})=γαt∑j∈[d]P[It=j|I0=i]qjcjc1+⋯+cd |
and hence
limx→∞P[Gi>x]P[|R|>x]=∞∑t=0γαt∑j∈[d]P[It=j|I0=i]qjcjc1+⋯+cd=11−γα∑j∈[d]P[IN=j|I0=i]qjcjc1+⋯+cd, |
where N∼Geo(γα) is independent of (It)t. With wij=P[IN=j|I0=i] and (3.9) we have
limx→∞P[Gi>x]x−αL(x)=∑j∈[d]wijqjcj1−γα. |
The tail asymptotic of P[Gi<−x]x−αL(x) can be calculated the same way using
μR({x∈Rd|eTjx<−1})=limx→∞P[Rj<−x]P[|R|>x]=(1−qj)cjc1+⋯+cd. |
To see that Gi is regularly varying with index α note that P[|Gi|>x]=P[Gi>x]+P[Gi<−x]∼(∑dj=1wijcj)x−αL(x) is a regularly varying function of index α and the balance equation hold:
limx→∞P[Gi>x]P[|Gi|>x]=∑dj=1qjwijcj∑dj=1wijcjandlimx→∞P[Gi<−x]P[|Gi|>x]=∑dj=1(1−qj)wijcj∑dj=1wijcj=1−∑dj=1qjwijcj∑dj=1wijcj. |
The authors have not used Artificial Intelligence (AI) tools in the creation of this article.
We thank three anonymous referees for their careful reading, many constructive remarks and suggestions. The referees' comments led to a significant improvement of the manuscript.
The first author was partially supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 502386356.
The authors declare there is no conflicts of interest.
[1] | M. G. Bellemare, W. Dabney, R. Munos, A distributional perspective on reinforcement learning, in International Conference on Machine Learning, (2017), 449–458. |
[2] | M. G. Bellemare, W. Dabney, M. Rowland, Distributional Reinforcement Learning, MIT Press, 2023. |
[3] | M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, 1994. https://doi.org/10.1002/9780470316887 |
[4] | M. Rowland, M. Bellemare, W. Dabney, R. Munos, Y. W. Teh, An analysis of categorical distributional reinforcement learning, in International Conference on Artificial Intelligence and Statistics, (2018), 29–37. |
[5] |
E. Krasheninnikova, J. García, R. Maestre, F. Fernández, Reinforcement learning for pricing strategy optimization in the insurance industry, Eng. Appl. Artif. Intell., 80 (2019), 8–19. https://doi.org/10.1016/j.engappai.2019.01.010 doi: 10.1016/j.engappai.2019.01.010
![]() |
[6] | P. N. Kolm, G. Ritter, Modern perspectives on reinforcement learning in finance, J. Mach. Learn. Finance, 1 (2019). https://doi.org/10.2139/ssrn.3449401 |
[7] | P. Embrechts, C. Klüppelberg, T. Mikosch, Modelling Extremal Events: For Insurance and Finance, Springer-Verlag, Berlin, 1997. https://doi.org/10.1007/978-3-642-33483-2 |
[8] | V. Zhuang, Y. Sui, No-regret reinforcement learning with heavy-tailed rewards, in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, (2021), 3385–3393. |
[9] | A. M. Medina, S. Yang, No-regret algorithms for heavy-tailed linear bandits, in Proceedings of The 33rd International Conference on Machine Learning (eds. M. F. Balcan and K. Q. Weinberger), PMLR, (2016), 1642–1650. https://proceedings.mlr.press/v48/medina16.html |
[10] | X. Yu, H. Shao, M. R. Lyu, I. King, Pure exploration of multi-armed bandits with heavy-tailed payoffs, in Conference on Uncertainty in Artificial Intelligence, (2018), 937–946. |
[11] | H. Shao, X. Yu, I. King, M. R. Lyu, Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs, in Advances in Neural Information Processing Systems (eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett), Curran Associates, Inc., 2018. |
[12] | S. Lu, G. Wang, Y. Hu, L. Zhang, Optimal algorithms for lipschitz bandits with heavy-tailed rewards, in International Conference on Machine Learning, PMLR, (2019), 4154–4163. |
[13] | S. R. Chowdhury, A. Gopalan, Bayesian optimization under heavy-tailed payoffs, in Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox and R. Garnett), Curran Associates, Inc., 2019. |
[14] | A. Dubey, A. Pentland, Thompson sampling on symmetric alpha-stable bandits, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, (2019), 5715–5721. https://doi.org/10.24963/ijcai.2019/792 |
[15] |
G. Alsmeyer, F. Buckmann, Stability of perpetuities in Markovian environment, J. Differ. Equations Appl., 23 (2017), 699–740. https://doi.org/10.1080/10236198.2016.1271878 doi: 10.1080/10236198.2016.1271878
![]() |
[16] | D. Buraczewski, E. Damek, T. Mikosch, Stochastic Models with Power-law Tails, Springer, 2016, https://doi.org/10.1007/978-3-319-29679-1 |
[17] | T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, T. Tanaka, Nonparametric return distribution approximation for reinforcement learning, in Proceedings of the 27th International Conference on International Conference on Machine Learning, (2010), 799–806. |
[18] |
K. J. Chung, M. J. Sobel, Discounted mdp's: Distribution functions and exponential utility maximization, SIAM J. Control Optim., 25 (1987), 49–62. https://doi.org/10.1137/0325004 doi: 10.1137/0325004
![]() |
[19] |
W. Vervaat, On a stochastic difference equation and a representation of nonnegative infinitely divisible random variables, Adv. Appl. Probab., 11 (1979), 750–783. https://doi.org/10.2307/1426858 doi: 10.2307/1426858
![]() |
[20] |
C. M. Goldie, R. Grübel, Perpetuities with thin tails, Adv. Appl. Probab., 28 (1996), 463–480, https://doi.org/10.2307/1428067. doi: 10.2307/1428067
![]() |
[21] |
C. M. Goldie, R. A. Maller, Stability of perpetuities, Ann. Probab., 28 (2000), 1195–1218. https://doi.org/10.1214/aop/1019160331 doi: 10.1214/aop/1019160331
![]() |
[22] |
U. Rösler, A fixed point theorem for distributions, Stochastic Process. Appl., 42 (1992), 195–214. https://doi.org/10.1016/0304-4149(92)90035-O doi: 10.1016/0304-4149(92)90035-O
![]() |
[23] | N. H. Bingham, C. M. Goldie, J. L. Teugels, Regular Variation, Cambridge University Press, Cambridge, 1987. https://doi.org/10.1017/CBO9780511721434 |
[24] | D. Cline, Infinite series of random variables with regularly varying tails, Inst. Appl. Math. Statist., (1983). |
[25] |
A. Brandt, The stochastic equation Yn+1=AnYn+Bn with stationary coefficients, Adv. Appl. Probab., 18 (1986), 211–220. https://doi.org/10.2307/1427243 doi: 10.2307/1427243
![]() |
[26] |
T. Erhardsson, Conditions for convergence of random coefficient AR(1) processes and perpetuities in higher dimensions, Bernoulli, 20 (2014), 990–1005. https://doi.org/10.3150/13-BEJ513 doi: 10.3150/13-BEJ513
![]() |
[27] | O. Kallenberg, Foundations of Modern Probability, Springer, 2021. https://doi.org/10.1007/978-3-030-61871-1 |
[28] | P. Bougerol, N. Picard, Strict stationarity of generalized autoregressive processes, Ann. Probab., 20 (1992), 1714–1730. |
[29] | T. Mikosch, Regular Variation, Subexponentiality and Their Applications in Probability Theory, Eindhoven University of Technology Eindhoven, The Netherlands, 1999. |
1. | Xueying Yang, Chen Li, Xu Li, Zhonghua Lu, A Parallel Monte Carlo Algorithm for the Life Cycle Asset Allocation Problem, 2024, 14, 2076-3417, 10372, 10.3390/app142210372 | |
2. | Minfeng Yu, Bo Li, Shuaiyu Zhao, Nitin Roy, Bin Zhang, PPO-based resilient control framework for safer operation of exothermic CSTR, 2024, 09575820, 10.1016/j.psep.2024.11.059 |