Processing math: 93%
Research article Special Issues

Multiscale regression on unknown manifolds

  • We consider the regression problem of estimating functions on RD but supported on a d-dimensional manifold M  RD with dD. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on M at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in D and exponential in d. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.

    Citation: Wenjing Liao, Mauro Maggioni, Stefano Vigogna. Multiscale regression on unknown manifolds[J]. Mathematics in Engineering, 2022, 4(4): 1-25. doi: 10.3934/mine.2022028

    Related Papers:

    [1] Simon Lemaire, Julien Moatti . Structure preservation in high-order hybrid discretisations of potential-driven advection-diffusion: linear and nonlinear approaches. Mathematics in Engineering, 2024, 6(1): 100-136. doi: 10.3934/mine.2024005
    [2] Virginia Agostiniani, Lorenzo Mazzieri, Francesca Oronzio . A geometric capacitary inequality for sub-static manifolds with harmonic potentials. Mathematics in Engineering, 2022, 4(2): 1-40. doi: 10.3934/mine.2022013
    [3] Lina Zhao, Eun-Jae Park . A locally conservative staggered least squares method on polygonal meshes. Mathematics in Engineering, 2024, 6(2): 339-362. doi: 10.3934/mine.2024014
    [4] Claudio Canuto, Davide Fassino . Higher-order adaptive virtual element methods with contraction properties. Mathematics in Engineering, 2023, 5(6): 1-33. doi: 10.3934/mine.2023101
    [5] Matteo Lapucci, Davide Pucci . Mixed-integer quadratic programming reformulations of multi-task learning models. Mathematics in Engineering, 2023, 5(1): 1-16. doi: 10.3934/mine.2023020
    [6] Iakovidis Isidoros, Nicola Arcozzi . Improved convergence rates for some kernel random forest algorithms. Mathematics in Engineering, 2024, 6(2): 305-338. doi: 10.3934/mine.2024013
    [7] Giacomo Ascione, Daniele Castorina, Giovanni Catino, Carlo Mantegazza . A matrix Harnack inequality for semilinear heat equations. Mathematics in Engineering, 2023, 5(1): 1-15. doi: 10.3934/mine.2023003
    [8] Jinghong Li, Hongyu Liu, Wing-Yan Tsui, Xianchao Wang . An inverse scattering approach for geometric body generation: a machine learning perspective. Mathematics in Engineering, 2019, 1(4): 800-823. doi: 10.3934/mine.2019.4.800
    [9] Zeljko Kereta, Valeriya Naumova . On an unsupervised method for parameter selection for the elastic net. Mathematics in Engineering, 2022, 4(6): 1-36. doi: 10.3934/mine.2022053
    [10] Cristiana De Filippis . Optimal gradient estimates for multi-phase integrals. Mathematics in Engineering, 2022, 4(5): 1-36. doi: 10.3934/mine.2022043
  • We consider the regression problem of estimating functions on RD but supported on a d-dimensional manifold M  RD with dD. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on M at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in D and exponential in d. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.



    High-dimensional data challenge classical statistical models and require new understanding of tradeoffs in accuracy and efficiency. The seemingly quantitative fact of the increase of dimension has qualitative consequences in both methodology and implementation, demanding new ways to break what has been called the curse of dimensionality. On the other hand, the presence of inherent nonuniform structure in the data calls into question linear dimension reduction techniques, and motivates a search for intrinsic learning models. In this paper we explore the idea of learning and exploiting the intrinsic geometry and regularity of the data in the context of regression analysis. Our goal is to build low-dimensional representations of high dimensional functions, while ensuring good generalization properties and fast implementation. In view of the complexity of the data, we allow interesting features to change from scale to scale and from location to location. Hence, we will develop multiscale methods, extending classical ideas of multi-resolution analysis beyond regular domains and to the random sample regime.

    In regression, the problem is to estimate a function from a finite set of random samples. The minimax mean squared error (MSE) for estimating functions in the Hölder space Cs([0,1]D), s>0, is O(n2s/(2s+D)), where n is the number of samples. The exponential dependence of the minimax rate on D manifests the curse of dimensionality in statistical learning, as n=O(ε(2s+D)/s) points are generally needed to achieve accuracy ε. This rate is optimal (in the minimax sense), unless further structural assumptions are made [28,32]. If the samples concentrate near a d-dimensional set with dD, and the function belongs to a nonuniform smoothness space BS, with S>s, we may hope to find estimators converging in O(n2S/(2S+d)). In this quantified sense, we may break the curse of dimensionality by adapting to the intrinsic dimension and regularity of the problem.

    A possible approach to this problem is based on first performing dimension reduction, and then regression in the reduced space. Linear dimension reduction methods include principal component analysis (PCA) [24,25,39], for data concentrating on a single subspace, or subspace clustering [8,9,18,36,47], for a union of subspaces. Going beyond linear models, we encounter isomap [43], locally linear embedding [40], local tangent space alignment [49], Laplacian eigenmaps [2], Hessian eigenmap [15] and diffusion map [12]. Besides the classical Principal Component Regression [26], in [33] diffusion map is used for nonparametric regression expanding the unknown function over the eigenfunctions of a kernel-based operator. It is proved that, when data lie on a d-dimensional manifold, the MSE converges in O(n1/O(d2  )). This rate depends only on the intrinsic dimension, but does not match the minimax rate in the Euclidean space. If infinitely many unlabeled points are sampled, so that the eigenfunctions are exactly computed, the MSE can achieve optimal rates for Sobolev functions with smoothness parameter at least 1. Similar results hold for regression with the Laplacian eigenmaps [50].

    Some regression methods have been shown to automatically adapt to the intrinsic dimension and perform as well as if the intrinsic domain was known. Results in this direction have been established for local linear regression [4], k-nearest neighbors [29], and kernel regression [31], where optimal rates depending on the intrinsic dimension were proved for functions in C2, C1, and Cs with s1, respectively. Kernel methods such as kernel ridge regression are also known to adapt to the intrinsic dimension [41,48], while suitable variants of regression trees have been proved to attain intrinsic yet suboptimal learning rates [30]. On the other hand, dyadic partitioning estimates with piecewise polynomial regression can cover the whole scale of spaces Cs, s>0 [21], and be combined with wavelet thresholding techniques to optimally adapt to broader classes of nonuniform regularity [5,6]. However, such estimators are cursed by the ambient dimension D, due to the exponential cardinality of a dyadic partition of the D-dimensional hypercube.

    This paper aims at generalizing dyadic partitioning estimates [5,6] to predict functions supported on low-dimensional sets, with optimal performance guarantees and low computational cost. We tie together ideas in classical statistical learning [20,21,45], multi-resolution analysis [12,13,38], and nonlinear approximation [11,16,17]. Our main tool is geometric multi-resolution analysis (GMRA) [1,34,35,37], which is a multiscale geometric approximation scheme for point clouds in high dimensions concentrating near low-dimensional sets. Using GMRA we learn low-dimensional local coordinates at multiple scales, on which we perform a multiscale regression estimate by fitting local polynomials. Inspired by wavelet thresholding techniques [5,6,11], we then compute differences between estimators at adjacent scales, and retain the locations where such differences are large enough. This empirically reveals where higher resolution is required to attain a good approximation, generating a data-driven partition which adapts to the local regularity of the function.

    Our approach has several distinctive features:

    (i) it is multiscale, and is therefore well-suited for data sets containing variable structural information at different scales;

    (ii) it is adaptive, allowing the function to have localized singularities or variable regularity;

    (iii) it is entirely data-driven, that is, it does not require a priori knowledge about the regularity of the function, and rather learns it automatically from the data;

    (iv) it is provable, with strong theoretical guarantees of optimal performance on large classes of priors;

    (v) it is efficient, having straightforward implementation, minor parameter tuning, and low computational cost.

    We will prove that, for functions supported on a d-dimensional manifold and belonging to a rich model class characterized by a smoothness parameter S, the MSE of our estimator converges at rate O((logn/n)2S/(2S+d)). This model class contains classical Hölder continuous functions, but further accounts for potential nonuniform regularity. Our results show that, up to a logarithmic factor, we attain the same optimal learning rate as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. In particular, the rate of convergence depends on the intrinsic dimension d and not on the ambient dimension D. In terms of computations, all the constructions above can be realized by algorithms of complexity O(nlogn), with constants linear in the ambient dimension D and exponential in the intrinsic dimension d.

    The remainder of this paper is organized as follows. We conclude this section by defining some general notation and formalizing the problem setup. In Section 2 we review geometric multi-resolution analysis. In Section 3 we introduce our multiscale regression method, establish the performance guarantees, and discuss the computational complexity of our algorithms. The proofs of our results are collected in Section 4, with some technical details postponed to Appendix A.

    Notation. fg and fg mean that there exists a positive constant C, independent on any variable upon which f and g depend, such that fCg and fCg, respectively. fg means that both fg and fg hold. The cardinality of a set A is denoted by #A. For xRD, x denotes the Euclidean norm and Br(x) denotes the Euclidean ball of radius r centered at x. Given a subspace V  RD, we denote its dimension by dim(V) and the orthogonal projection onto V by ProjV. Let f,g:MR be two functions, and let ρ be a probability measure supported on M. We define the inner product of f and g with respect to ρ as f,g:=Mf(x)g(x)dρ. The L2 norm of f with respect to ρ is f:=(M|f(x)|2dρ)12. Given n i.i.d. samples {xi}ni=1 of ρ, the empirical L2 norm of f is fn:=1nni=1|f(xi)|2. The L norm of f is f:=supess|f|. We denote probability and expectation by P and E, respectively. For a fixed M>0, TM is the truncation operator defined by TM(x):=min(|x|,M)sign(x). We denote by 1j,k the indicator function of an indexed set Cj,k (i.e., 1j,k(x)=1 if xCj,k, and 0 otherwise).

    Setup. We consider the problem of estimating a function f:MR given n samples {(xi,yi)}ni=1, where

    M is an unknown Riemannian manifold of dimension d isometrically embedded in RD, with dD;

    ρ is an unknown probability measure supported on M;

    {xi}ni=1 are independently drawn from ρ;

    yi=f(xi)+ζi;

    f is bounded, with fM;

    {ζi}ni=1 are i.i.d. sub-Gaussian random variables with sub-Gaussian norm ζiψ2σ2, independent of the xi's.

    We wish to construct an estimator ˆf of f minimizing the mean squared error

    MSE:=Efˆf2=EM|f(x)ˆf(x)|2dρ.

    Geometric multi-resolution analysis (GMRA) is an efficient tool to build low-dimensional representations of data concentrating on or near a low-dimensional set embedded in high dimensions. To keep the presentation self-contained, we summarize here the main ideas, and refer the reader to [1,34,37] for further details. Given a probability measure ρ supported on a d-dimensional manifold M  RD, GMRA performs the following steps:

    (1). Construct a multiscale tree decomposition T of M into nested cells T:={Cj,k}kKj,jZ, where j represents the scale and k the location. Here Kj is a location index set.

    (2). Compute a local principal component analysis on each Cj,k. Let cj,k be the mean of x on Cj,k, and Vj,k the d-dimensional principal subspace of Cj,k. Define Pj,k:=cj,k+ProjVj,k(xcj,k).

    An ideal multiscale tree decomposition should satisfy assumptions (A1)÷(A5) below for all integers jjmin:

    (A1) For every kKj and kKj+1, either Cj+1,k  Cj,k or ρ(Cj+1,kCj,k)=0. The children of Cj,k are the cells Cj+1,k such that Cj+1,k  Cj,k. We assume that 1amin#{Cj+1,k:Cj+1,k  Cj,k}amax for all kKj and jjmin. Also, for every Cj,k, there exists a unique kKj1 such that Cj,k  Cj1,k. We call Cj1,k the parent of Cj,k.

    (A2) ρ(MkKjCj,k)=0, i.e., Λj:={Cj,k}kKj is a partition of M, up to negligible sets.

    (A3) There exists θ1>0 such that #Λj2jd/θ1.

    (A4) There exists θ2>0 such that, if x is drawn from ρ conditioned on Cj,k, then xcj,kθ22j almost surely.

    (A5) Let λj,k1λj,k2λj,kD be the eigenvalues of the covariance matrix Σj,k of ρ|Cj,k, defined in Table 1. Then:

    Table 1.  Objects of GMRA. Vj,k and ˆVj,k are the eigenspaces associated with the largest d eigenvalues of Σj,k and ˆΣj,k, respectively.
    oracles empirical counterparts
    ρ(Cj,k) ˆρ(Cj,k):=ˆnj,kn,ˆnj,k:=#{xi:xiCj,k}
    cj,k:=1ρ(Cj,k)Cj,kxdρ ˆcj,k:=1ˆnj,kxiCj,kxi
    Σj,k:=1ρ(Cj,k)Cj,k(xcj,k)(xcj,k)Tdρ ˆΣj,k:=1ˆnj,kxiCj,k(xiˆcj,k)(xiˆcj,k)T
    Vj,k:=argmindimV=d1ρ(Cj,k)Cj,kxcj,kProjV(xcj,k)2dρ ˆVj,k:=argmindimV=d1ˆnj,kxiCj,kxˆcj,kProjV(xˆcj,k)2
    Pj,k(x):=cj,k+ProjVj,k(xcj,k) ˆPj,k(x):=ˆcj,k+ProjˆVj,k(xˆcj,k)

     | Show Table
    DownLoad: CSV

    (i) there exists θ3>0 such that, for every jjmin and kKj, λj,kdθ322j/d;

    (ii) there exists θ4(0,1) such that λj,kd+1θ4λj,kd.

    These are natural properties for multiscale partitions generalizing dyadic partitions to nonEuclidean domains [10]. (A1) establishes that the cells constitute a tree structure. (A2) says that the cells at scale j form a partition. (A3) guarantees that there are at most 2jd/θ1 cells at scale j. (A4) ensures that the diameter of all cells at scale j is bounded by 2j, up to a uniform constant. (A5)(ii) assumes that the best rank d approximation to the covariance of a cell is close to the covariance matrix of a d-dimensional Euclidean ball, while (A5)(ii) assumes that the cell has significantly larger variance in d directions than in all the remaining ones.

    Since all cells at scale j have similar diameter, Λj is called a uniform partition. A master tree T is a tree satisfying the properties above. A proper subtree ˜T of T is a collection of nodes of T with the properties: the root node is in ˜T; if a node is in ˜T, then its parent is also in ˜T. Any finite proper subtree ˜T is associated with a unique partition Λ=Λ(˜T) consisting of its outer leaves, by which we mean those nodes that are not in ˜T, but whose parent is.

    In practice, the master tree T is not given. We will construct one by an application of the cover tree algorithm [3] (see [34,Algorithm 3]). In order to make the samples for tree construction and function estimation independent from each other, we split the data in half and use one subset to construct the tree and the other one for local PCA and regression. From now on we index the training data as {(xi,yi)}2ni=1, and split them in {(xi,yi)}2ni=1={(xi,yi)}ni=1{(xi,yi)}2ni=n+1. Running Algorithm [34,Algorithm 3] on {xi}2ni=n+1, we construct a family of cells {ˆCj,k}kKj,jmin  j  jmax which satisfies (A1)÷(A4) with high probability if ρ is doubling*; furthermore, if M is a Cs, s(1,), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then (A5) is satisfied as well:

    *ρ is doubling if there exists C1>1 such that C11rdρ(MBr(x))C1rd for any xM and r>0; C1 is called the doubling constant of ρ. See also [10,14].

    Proposition 1 (Proposition 14 in [34]). Assume ρ is a doubling probability measure on M with doubling constant C1. Then, the ˆCj,k's constructed from [34,Algorithm 3] satisfy:

    (a1)(A1) with amax=C21(24)d and amin=1;

    (a2) let ˆM=jmaxj=jminkKjˆCj,k; for any ν>0,

    P{ρ(MˆM)>28νlogn3n}2nν;

    (a3)(A3) with θ1=C114d;

    (a4)(A4) with θ2=3.

    If additionally M is a Cs,s(1,), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then

    (a5)(A5) is satisfied when j is sufficiently large.

    Since there are finite training points, the constructed master tree has a finite number of nodes. We first build a tree whose leaves contain a single point, and then prune it to the largest subtree whose leaves contain at least d training points. This pruned tree associated with the ˆCj,k's is called the data master tree, and denoted by Tn. The ˆCj,k's cover ˆM, which represents the part of M that has been explored by the data.

    Even though assumption (A2) is not exactly satisfied, we claim that (a2) is sufficient for our performance guarantees, for example in the case where fM. Indeed, simply estimating f on MˆM by 0, for any ν>0 we have

    P{MˆMf2dρ28M2νlogn3n}2nνandEMˆMf2dρ56M2νlogn3n1+ν.

    In view of these bounds, the rate of convergence on MˆM is faster than the ones we will obtain on ˆM. We will therefore assume (A2), thanks to (a2). Also, it may happen that conditions (A3)÷(A5) are satisfied at the coarsest scales with very poor constants θ. Nonetheless, it will be clear that in all that follows we may discard a few coarse scales, and only work at scales that are fine enough and for which (A3)÷(A5) truly capture in a quantitative way the local geometry of M. Since regression is performed on an independent subset of data, we can assume, by conditioning, that the ˆCj,k's are given and satisfy the required assumptions. To keep the notation simple, from now on we will use Cj,k instead of ˆCj,k, and M in place of ˆM, with a slight abuse of notation.

    Besides cover tree, there are other methods that can be applied in practice to obtain multiscale partitions, such as METIS [27], used in [1], iterated PCA [42], and iterated k-means. These methods can be computationally more efficient than cover tree, but lead to partitions where the properties (A1)÷(A5) are not guaranteed to hold.

    After constructing the multiscale tree T, GMRA computes a collection of affine projectors {Pj:RDRD}jjmin. The main objects of GMRA in their population and sample version are summarized in Table 1. Given a suitable partition Λ  T, M can be approximated by the piecewise linear set {Pj,k(Cj,k)}Cj,kΛ.

    Given a multiscale tree decomposition {Cj,k}j,k and training samples {(xi,yi)}ni=1, we construct a family {ˆfj,k}j,k of local estimates of f in two stages: first we compute local coordinates on Cj,k using GMRA outlined above, and then we estimate f|Cj,k by fitting a polynomial of order on such coordinates. A global estimator ˆfΛ is finally obtained by summing the local estimates over a suitable partition Λ. Our regression method is specified in Algorithm 1. The explicit constructions of the constant (=0) and linear (=1) local estimators are detailed in Table 2.

    Table 2.  Constant and linear local estimators. The truncation in [Λj,k]d regularizes the least squares problem, which is ill-posed due to the small eigenvalues {λj,kl}Dl=d+1.
    oracles empirical counterparts
    piecewise constant (=0)
    g0j,k(x):=yj,k:=1ρ(Cj,k)Cj,kydρ ˆg0j,k(x):=ˆyj,k:=1ˆnj,kxiCj,kyi
    f0j,k(x):=TM[g0j,k(x)] ˆf0j,k(x):=TM[ˆg0j,k(x)]
    piecewise linear (=1)
    g1j,k(x):=[πj,k(x)T 2j]βj,k ˆg1j,k(x):=[ˆπj,k(x)T 2j]ˆβj,k
    βj,k:=[[Λj,k]1d0022j]1ρ(Cj,k)Cj,ky[πj,k(x)2j]dρ ˆβj,k:=[[ˆΛj,k]1d0022j]1ˉnj,kxiCj,kyi[ˆπj,k(xi)2j]
    [Λj,k]d:=diag(λj,k1,,λj,kd) [ˆΛj,k]d:=diag(ˆλj,k1,,ˆλj,kd)
    f1j,k(x):=TM[g1j,k(x)] ˆf1j,k(x):=TM[ˆg1j,k(x)]

     | Show Table
    DownLoad: CSV

    In order to analyze the performance of our method, we introduce the oracle estimator fΛ based on the distribution ρ, defined by

    πj,k:Cj,kRd,πj,k(x):=VTj,k(xcj,k),pj,k:=argminpPCj,k|ypπj,k(x)|2dρ,
    fj,k:=TM[pj,kπj,k]fΛ:=Cj,kΛfj,k1j,k

    and split the MSE into a bias and a variance term:

    EfˆfΛ22ffΛ2bias2+2EfΛˆfΛ2variance. (1)

    The bias term is a deterministic approximation error, and will be handled by assuming suitable regularity models for ρ and f (see Definitions 1 and 3). The variance term quantifies the stochastic error arising from finite-sample estimation, and will be bounded using concentration inequalities (see Proposition 2). The role of Λ, encoded in its size #Λ, is crucial to balance (1). We will discuss two possible choices: uniform partitions in Section 3.1, and adaptive at multiple scales in Section 3.2.

    Algorithm 1 GMRA regression
    Input: training data {xi,yi}2ni=1, intrinsic dimension d, bound M, polynomial order , approximation type (uniform or adaptive).
    Output: multiscale tree decomposition Tn, partition Λ, piecewise -order polynomial estimator ˆfΛ.
    1: construct a multiscale tree Tn by [34,Algorithm 3] on {xi}2ni=n+1;
    2: compute centers ˆcj,k and subspaces ˆVj,k by empirical GMRA on {xi}ni=1;
    3: define coordinates ˆπj,k on Cj,k:
    ˆπj,k:Cj,kRd,ˆπj,k(x):=ˆVj,kT(xˆcj,k);
    4: compute local estimators ˆgj,k by solving the following least squares problems over the space P of polynomials of degree :
    ˆpj,k:=argminpP1ˆnj,kni=1|yipˆπj,k(xi)|21j,k(xi),ˆgj,k:=ˆpj,kˆπj,k;   (2)
    5: truncate ˆgj,k by M:
    ˆfj,k:=TM[ˆgj,k];
    6: construct a uniform (see Section 3.1) or adaptive (see Section 3.2) partition Λ;
    7: define the global estimator ˆfΛ by summing the local estimators over the partition Λ:
    ˆfΛ:=Cj,kΛˆfj,k1j,k.

     | Show Table
    DownLoad: CSV

    A first natural choice for Λ is a uniform partition Λj:={Cj,k}kKj,jjmin. At scale j, f is estimated by ˆfΛj=kKjˆfj,k1j,k. The bias ffΛj decays at a rate depending on the regularity of f, which can be quantified as follows:

    Definition 1 (model class As). A function f:MR is in the class As for some s>0 with respect to the measure ρ if

    |f|As:=supT supjjmin ffΛj2js<,

    where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    We capture the case where the bias is roughly the same on every cell with the following definition:

    Definition 2 (model class A,s). A function f:MR is in the class A,s for some s>0 with respect to the measure ρ if

    |f|A,s:=supT supjjmin supkKj (ffj,k)1j,k2jsρ(Cj,k)<,

    where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    Clearly A,s  As. These classes contain uniformly regular functions on manifolds, such as Hölder functions.

    Example 1. Let M be a closed smooth d-dimensional Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Consider a function f:MR and a smooth chart (U,ϕ) on M. The function ˜f:ϕ(U)R defined by ˜f(v)=fϕ1(v) is called the coordinate representation of f. Let λ=(λ1,,λd) be a multi-index with |λ|:=λ1++λd=. The -order λ-derivative of f is defined as

    λf(x):=λ(fϕ1).

    Hölder functions C,α on M with N and α(0,1] are defined as follows: fC,α if the -order derivatives of f exist, and

    |f|C,α:=max|λ|=supxz|λf(x)λf(z)|d(x,z)α<,

    d(x,z) being the geodesic distance between x and z. We will always assume to work at sufficiently fine scales at which d(x,z)zxRD. Note that C,1 is the space of -times continuously differentiable functions on M with Lipschitz -order derivatives. We have C,α  A,+α with |f|A,+αθ+α2d|f|C,α/!. The proof is in Appendix A.

    Example 2. Let M be a smooth closed Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Let Ω  M such that Γ:=Ω is a smooth and closed dΓ-dimensional submanifold with finite reach. Let g=a1Ω+b1Ω for some a,bR, where 1S denotes the indicator function of a set S. Then gA(ddΓ)/2 for every =0,1,2,; however, gA,s for any s>0. The proof is in Appendix A.

    The reach of M is an important global characteristic of M. Let D(M):={yRD:! xM s.t. xy=infzMzy}, Mr:={yRD:infxMxy<r}. Then reach(M):=sup{r0:Mr  D(M)}. See also [19].

    When we take uniform partitions Λ=Λj in (1), the squared bias satisfies

    ffΛj2|f|2As22js

    whenever fAs, which decreases as j increases. On the other hand, Proposition 2 shows that the variance at the scale j satisfies

    EfΛjˆfΛj2O(j2jdn),

    which increases as j increases. Choosing the optimal scale j in the bias-variance tradeoff, we obtain the following rate of convergence for uniform estimators:

    Theorem 1. Suppose fM and fAs for {0,1} and s>0. Let j be chosen such that

    2j:=μ(lognn)12s+d

    for μ>0. Then there exist positive constants c:=c(θ1,d,μ) and C:=C(θ1,d,μ) for =0, or c:=c(θ1,θ2,θ3,d,μ) and C:=C(θ1,θ2,θ3,d,μ) for =1, such that:

    (a) for every ν>0 there is cν>0 such that

    P{fˆfΛj>(|f|Asμs+cν)(lognn)s2s+d}Cnν,

    where cν:=cν(ν,θ1,d,M,σ,s,μ) for =0, and cν:=cν(ν,θ1,θ2,θ3,d,M,σ,s,μ) for =1;

    (b) EfˆfΛj2(|f|2Asμs+cmax(M2,σ2))(lognn)2s2s+d.

    Theorem 1 is proved in Section 4. Note that the rate depends on the intrinsic dimension d instead of the ambient dimension D. Moreover, the rate is optimal (up to logarithmic factors) at least in the case of C,α functions on M, as discussed in Example 1.

    Theorem 1 is not fully satisfactory for two reasons: (i) the choice of the optimal scale requires knowledge of the regularity of the unknown function; (ii) no uniform scale can be optimal if the regularity of the function varies at different locations and scales. We thus propose an adaptive estimator which learns near-optimal partitions from data, without knowing the possibly nonuniform regularity of the function. Adaptive partitions may be selected by a criterion that determines whether or not a cell should be picked or not. The quantities involved in this selection are summarized in Table 3, along with their empirical versions.

    Table 3.  Local approximation difference between scales.
    oracles empirical counterparts
    Wj,k:=(fΛjfΛj+1)1j,k ˆWj,k:=(ˆfΛjˆfΛj+1)1j,k
    Δj,k:=Wj,k ˆΔj,k:=ˆWj,kn

     | Show Table
    DownLoad: CSV

    Δj,k plays the role of the magnitude of a wavelet coefficient in typical wavelet thresholding constructions, and reduces to it in the case of Haar wavelets on Euclidean domains by dyadic partitioning. It measures the local difference in approximation between two consecutive scales: a large Δj,k suggests a significant reduction of error if we refine Cj,k to its children. Intuitively, we should truncate the master tree to the subtree including the nodes where this quantity is large. However, if too few samples exist in a node, then the empirical counterpart ˆΔj,k can not be trusted. We thus proceed as follows. We set a threshold τn decreasing in n, and let ˆTn(τn) be the smallest proper subtree of Tn containing all Cj,k's for which ˆΔj,kτn. Crucially, τn may be chosen independently of the regularity of f (see Theorem 2). We finally define our adaptive partition ˆΛn(τn) as the partition associated with the outer leaves of ˆTn(τn). The procedure is summarized in Algorithm 2.

    Algorithm 2 Adaptive partition
    Input: training data {(xi,yi)}ni=1, multiscale tree decomposition Tn, local -order polynomial estimates {ˆfj,k}j,k, threshold parameter κ.
    Output: adaptive partition ˆΛn(τn).
    1: compute the approximation difference ˆΔj,k on every node Cj,kTn;
    2: set the threshold τn:=κ(logn)/n;
    3: select the smallest proper subtree ˆTn(τn) of Tn containing all Cj,k's with ˆΔj,kτn;
    4: define the adaptive partition ˆΛn(τn) associated with the outer leaves of ˆTn(τn).

     | Show Table
    DownLoad: CSV

    To provide performance guarantees for our adaptive estimator, we need to define a proper model class based on oracles. Given any master tree T satisfying assumptions (A1)÷(A5) and a threshold τ>0, we let T(τ) be the smallest subtree of T consisting of all the cells Cj,k's with Δj,kτ. The partition made of the outer leaves of T(τ) is denoted by Λ(τ).

    Definition 3 (model class Bs). A function f:MR is in the class Bs for some s>0 with respect to the measure ρ if

    |f|pBs:=supTsupτ>0τp#T(τ)<,p=2d2s+d,

    where T varies over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    In general, the truncated tree T(τ) grows as the threshold τ decreases. For elements in Bs, we have control on the growth rate, namely #T(τn)τp. In the classical case of dyadic partitions of the Euclidean space with uniform measure, Bs is well understood as a nonlinear approximation space containing a scale of Besov spaces [11]. The class Bs is indeed rich, and contains in particular A,s, while additionally capturing functions of nonuniform regularity.

    Lemma 1. A,s  Bs. If fA,s, then fBs and |f|Bs(amin/θ1)2s+d2d|f|A,s.

    The proof is given in Appendix A.

    Example 3 Let g be the function in Example 2. Then gBd(ddΓ)/(2dΓ) for every =0,1,2,. Notice that gA(ddΓ)/2, so g has a larger regularity parameter s in the Bs model than in the As model.

    We will also need a quasi-orthogonality condition ensuring that the functions Wj,k representing the approximation difference between two scales are almost orthogonal across scales.

    Definition 4. We say that f satisfies quasi-orthogonality of order with respect to the measure ρ if there exists a constant B0>0 such that, for any proper subtree S of any tree T satisfying assumptions (A1)÷(A5),

    Cj,kTSWj,k2B0Cj,kTSWj,k2.

    The following lemma shows that fBs, along with quasi-orthogonality, implies a certain approximation rate of f by fΛ(τ) as τ0. The proof is given in Appendix A.

    Lemma 2. If fBs(LAt) for some s,t>0, and f satisfies quasi-orthogonality of order , then

    ffΛ(τ)2Bs,d|f|pBsτ2pBs,d|f|2Bs#Λ(τ)2sd,p=2d2s+d,

    with Bs,d:=B02p02(2p).

    The main result of this paper is the following performance analysis of adaptive estimators, which is proved in Section 4.

    Theorem 2. Let {0,1} and M>0. Suppose fM and f satisfies quasi-orthogonality of order . Set τn:=κlogn/n. Then:

    (a) For every ν>0 there exists κν:=κν(amax,θ2,θ3,d,M,σ,ν)>0 such that, whenever fBs for some s>0 and κκν, there are r,C>0 such that

    P{fˆfˆΛn(τn)>r(lognn)s2s+d}Cnν.

    (b) There exists κ0:=κ0(amax,θ2,θ3,d,M,σ) such that, whenever fBs for some s>0 and κκ0, there is ˉC>0 such that

    EfˆfˆΛn(τn)2ˉC(lognn)2s2s+d.

    Here r depends on θ2, θ3, amax, d, M, s, |f|Bs, σ, B0, ν, κ; C depends on θ2, θ3, amax, amin, d, s, |f|Bs, κ; ˉC depends on θ2, θ3, amax, amin, d, M, s, |f|Bs, B0, κ.

    Theorem 2 is more satisfactory than Theorem 1 for two reasons: (i) the same rate is achieved for a richer model class; (ii) the estimator does not require a priori knowledge of the regularity of the function, since the choice of κ is independent of s.

    For a given accuracy ε, in order to achieve MSEε2, the number of samples we need is nε(1/ε)2s+dslog(1/ε). When s is unknown, we can determine s as follows: we fix a small n0, and run Algorithm 2 with 2n0,4n0,,2jn0, samples. For each sample size, we evenly split data into a training set to build the adaptive estimator, and a test set to evaluate the MSE. According to Theorem 2, the MSE scales as (logn/n)2s2s+d. Therefore, the slope in the log-log plot of the MSE versus n gives an approximation of 2s/(2s+d). This could be formalized by a suitable adaptation of Lepski's method.

    The computational cost of Algorithms 1 and 2 may be split as follows:

    Tree construction. Cover tree itself is an online algorithm where a single-point insertion or removal takes cost at most O(logn). The total computational cost of the cover tree algorithm is CdDnlogn, where C>0 is a constant [3].

    Local PCA. At every scale j, we perform local PCA on the training data restricted to the Cj,k for every kKj using the random PCA algorithm [22]. Recall that ˆnj,k denotes the number of training points in Cj,k. The cost of local PCA at scale j is in the order of kKjDdˆnj,k=Ddn, and there are at most clogn scales where c>0 is a constant, which gives a total cost of cDdnlogn.

    Multiscale regression. Given ˆnj,k training points on Cj,k, computing the low-dimensional coordinates ˆπj,k(xi) for all xiCj,k costs Ddˆnj,k, and solving the linear least squares problem (2), where the matrix is of size ˆnj,k×d, costs at most ˆnj,kd2. Hence, constructing the -order polynomials at scale j takes kKjDdˆnj,k+d2ˆnj,k=(Dd+d2)n, and there are at most clogn scales, which sums up to c(Dd+d2)nlogn.

    Adaptive approximation. We need to compute the coefficients ˆΔj,k for every Cj,k, which costs 2(Dd+d)ˆnj,k on Cj,k, and 2c(Dd+d)nlogn for the whole tree.

    In summary, the total cost of constructing GMRA adaptive estimators of order is

    CdDnlogn+(4Dd+d2+d)cnlogn.

    The cost scales linearly with the number of samples n up to a logarithmic factor, and linearly with the ambient dimension D.

    We analyze the error of our estimator by a bias-variance decomposition as in (1). We present the variance estimate in Section 4.1, the proofs for uniform approximations in Section 4.2, and for adaptive approximations in Section 4.3.

    The following proposition bounds the variance of 0-order (piecewise constant) and 1st-order (piecewise linear) estimators over an arbitrary partition Λ.

    Proposition 2. Suppose fM and let {0,1}. For any partition Λ, let fΛ and ˆfΛ be the optimal approximation and the empirical estimators of order on Λ, respectively. Then, for every η>0,

    (a) P{fΛˆfΛ>η}{C0#Λexp(nη2c0max(M2,σ2)#Λ)for  =0C1d#Λexp(nη2c1max(d4  M2,d2  σ2)#Λ)for  =1,

    (b) EfΛˆfΛ2{c0max(M2,σ2)#Λlog(C0#Λ)nfor  =0c1max(d4  M2,d2  σ2)#Λlog(C1d#Λ)nfor  =1,

    for some absolute constants c0,C0 and some c1,C1 depending on θ2,θ3.

    Proof. Since fΛ and ˆfΛ are bounded by M, we define Λ:={Cj,kΛ:ρ(Cj,k)η24M2#Λ}, and observe that

    Cj,kΛ(fΛˆfΛ)1j,k2η2.

    We then restrict our attention to Λ+:=ΛΛ and apply Lemma 5 with t=ηρ(Cj,k)#Λ. This leads to (a), while (b) follows from (a) by integrating over η>0.

    Notice that #Λj2jd/θ1 by (A3). By choosing j such that 2j=μ(lognn)12s+d for some μ>0, we have

    ffΛj|f|As2js|f|Asμs(lognn)s2s+d.

    Moreover, by Proposition 2,

    P{fΛjˆfΛjcν(lognn)s2s+d}{C0θ1  μd(logn)d2s+dn(θ1  μd  c2νc0max(M2,σ2)d2s+d)(=0)C1dθ1  μd(logn)d2s+dn(θ1  μd  c2νc1max(d4  M2,d2  σ2)d2s+d)(=1) Cnν

    provided that θ1  μd  c2νc0max(M2,σ2)d2s+d>ν for =0 and θ1  μd  c2νc1max(d4  M2,d2  σ2)d2s+d>ν for =1.

    We begin by defining several objects of interest:

    Tn: the data master tree whose leaves contain at least d points of training data. It can be viewed as the part of a multiscale tree that our training data have explored. Notice that

    #Tnj=0aminjnd=aminamin1ndaminnd.

    T: a complete multiscale tree containing Tn.

    T can be viewed as the union Tn and some empty cells, mostly at fine scales with high probability, that our data have not explored.

    T(τ): the smallest subtree of T which contains {Cj,kT:Δj,kτ}.

    Tn(τ):=T(τ)Tn.

    ˆTn(τ): the smallest subtree of Tn which contains {Cj,kTn:ˆΔj,kτ}.

    Λ(τ): the adaptive partition associated with T(τ).

    Λn(τ): the adaptive partition associated with Tn(τ).

    ˆΛn(τ): the adaptive partition associated with ˆTn(τ).

    ● Suppose T0 and T1 are two subtrees of T. If Λ0 and Λ1 are two adaptive partitions associated with T0 and T1 respectively, we denote by Λ0Λ1 and Λ0Λ1 the partitions associated to the trees T0T1 and T0T1 respectively.

    ● Let b=2amax+5 where amax is the maximal number of children that a node has in Tn.

    Inspired by the analysis of wavelet thresholding procedures [5,6], we split the error into four terms,

    fˆfˆΛn(τn)e1+e2+e3+e4,

    where

    e1:=ffˆΛn(τn)Λn(bτn)e2:=fˆΛn(τn)Λn(bτn)fˆΛn(τn)Λn(τn/b)e3:=fˆΛn(τn)Λn(τn/b)ˆfˆΛn(τn)Λn(τn/b)e4:=ˆfˆΛn(τn)Λn(τn/b)ˆfˆΛn(τn).

    The goal of the splitting above is to handle the bias and variance separately, as well as to deal with the fact the partition built from those Cj,k such that ˆΔj,kτn does not coincide with the partition which would be chosen by an oracle based on those Cj,k such that Δj,kτn. This is accounted by the terms e2 and e4 which correspond to those Cj,k such that ˆΔj,k is significantly larger or smaller than Δj,k respectively, and which will be proved to be small in probability. The e1 and e3 terms correspond to the bias and variance of oracle estimators based on partitions obtained by thresholding the unknown oracle change in approximation Δj,k.

    Since ˆΛn(τn)Λn(bτn) is a finer partition than Λn(bτn), we have

    e1ffΛn(bτn)ffΛ(bτn)+fΛ(bτn)fΛn(bτn)=:e11+e12.

    The e11 term is treated by a deterministic estimate based on the model class Bs: by Lemma 2 we have

    e211Bs,d|f|pBs(bκ)2p(logn/n)2s2s+d,

    The term e12 accounts for the error on the cells that have not been explored by our training data, which is small:

    P{e12>0}P{ Cj,kT(bτn)Tn(bτn)}=P{ Cj,kT(bτn):Δj,kbτn  and  ˆρ(Cj,k)<d/n}Cj,kT(bτn)P{Δj,kbτn  and  ˆρ(Cj,k)<d/n}.

    According to (4), we have (Δj,k)24f2ρ(Cj,k). Then every Cj,k with Δj,kbτn satisfies ρ(Cj,k)b2κ2f2(logn/n). Hence, provided that n satisfies b2κ2f2logn2d, we have

    P{Δj,kbτn  and  ˆρ(Cj,k)<d/n}P{|ρ(Cj,k)ˆρ(Cj,k)|12ρ(Cj,k)  and  ρ(Cj,k)b2κ2f2lognn}2n3b2κ228f2,

    where the last inequality follows from Lemma 3(b). Therefore, by Definition 3 we obtain

    P{e12>0}#T(bτn)n3b2κ228f2|f|pBs(bτn)pn3b2κ228f2|f|pBs(bκ)pn(3b2κ228f21)|f|pBs(bκ)pnν

    as long as 3b2κ228f21>ν. To estimate Ee212, we observe that, thanks to Lemma 6,

    e212=Cj,kΛn(bτn)Λ(bτn) Cj,kΛ(bτn)Cj,k  Cj,k(fj,kfj,k)1j,k2M2.

    Hence, by choosing ν=1>2s2s+d we get

    Ee212f2P{e12>0}f2|f|pBs(bκ)p(logn/n)2s2s+d.

    The term e3 is the variance term which can be estimated by Proposition 2 with Λ=ˆΛn(τn)Λn(τn/b). We plug in η=r(logn/n)s2s+d. Bounding #Λ by #Λn(τn/b)#Λnn/d (as our data master tree has at d points in each leaf) outside the exponential, and by #Λn(τn/b)#Λ(τn/b)|f|pBs(τn/b)p inside the exponential, we get the following estimates for e3:

    P{e3>r(lognn)s2s+d}{C0n1r2κpc0  bp|f|pBsmax{M2,σ2}(=0)C1n1γ2κpc1bp|f|pBsmax{d4M2,d2  σ2}(=1),

    where C0=C0(θ2,θ3,amax,d,s,|f|Bs,κ) and C1=C1(θ2,θ3,amax,d,s,|f|Bs,κ). We obtain P{e3>r(logn/n)s2s+d}Cnν as long as r is chosen large enough to make the exponent smaller than ν.

    To estimate Ee23, we apply again Propositions 2 and with #Λ|f|pBs(b/κ)p(logn/n)d2s+d , obtaining

    Ee23ˉC(logn/n)2s2s+d.

    Next we estimate e2 and e4. Since ˆTn(τn)Tn(τn/b)  ˆTn(τn)Tn(bτn) and Tn(bτn)  Tn(τn/b), we have e2>0 if and only if there is a Cj,kTn such that either Cj,k is in ˆTn(τn) but not in Tn(τn/b), or Cj,k is in Tn(bτn) but not in ˆTn(τn). This means that either ˆΔj,kτn but Δj,k<τn/b, or Δj,kbτn but ˆΔj,k<τn. As a consequence,

    P{e2>0}Cj,kTnP{ˆΔj,kτn and Δj,k<τn/b}+Cj,kTnP{Δj,kbτn and ˆΔj,k<τn},

    and analogously

    P{e4>0}Cj,kTnP{ˆΔj,kτn and Δj,k<τn/b}.

    We can now apply Lemma 7: we use (b) with η=τn/b, and (a) with η=τn. We obtain that

    P{e2>0}+P{e4>0}{C(amin,d)n1κ2c0  b2max{M2,σ2}(=0)C(θ2,θ3,amin,d)n1κ2c1b2max{d4  M2,d2  σ2}(=1).

    We have P{e2>0}+P{e4>0}Cnν provided that κ is chosen such that the exponents are smaller than ν.

    We are left to deal with the expectations. As for e2, Lemma 6 implies e2M, which gives rise to, for ν=1>2s2s+d,

    Ee22M2P{e2>0}CM2(logn/n)2s2s+d.

    The same bound holds for e4, which concludes the proof of Theorem 1.

    This section contains the main concentration inequalities of the empirical quantities on their oracles. For piecewise linear estimators, some quantities used in Lemma 5 are decomposed in Table 4. All proofs are collected in Appendix A.

    Table 4.  Decomposition of piecewise linear estimators into quantities used in Lemma 5.
    oracles empirical counterparts
    fj,k(x)=TM([(xcj,k)T  2j]Qj,krj,k) ˆfj,k(x)=TM([(xˆcj,k)T  2j]ˆQj,kˆrj,k)
    Qj,k:=[[Σj,k]d0022j] ˆQj,k:=[[ˆΣj,k]d0022j]
    [Σj,k]d:=Vj,k[Λj,k]1dVTj,k [ˆΣj,k]d:=ˆVj,k[ˆΛj,k]d1(ˆVj,k)T
    rj,k:=1ρ(Cj,k)Cj,ky[(xcj,k)2j]dρ ˆrj,k:=1ˆnj,kxiCj,kyi[(xiˆcj,k)2j]

     | Show Table
    DownLoad: CSV

    Lemma 3. For every t>0 we have:

    (a) P{|ρ(Cj,k)ˆρ(Cj,k)|>t}2exp(3nt26ρ(Cj,k)+2t);

    (b) setting t=12ρ(Cj,k) in (a) yields P{|ρ(Cj,k)ˆρ(Cj,k)|>12ρ(Cj,k)}2exp(328nρ(Cj,k));

    (c) P{cj,kˆcj,k>t}2exp(328nρ(Cj,k))+8exp(3nρ(Cj,k)t212θ2222j+4θ22jt);

    (d) P{Σj,kˆΣj,k>t}2exp(328nρ(Cj,k))+(4θ22θ3d+8)exp(3nρ(Cj,k)t296θ4224j+16θ2222jt).

    Lemma 4. We have:

    (a) P{Qj,kˆQj,k>48θ23d2  24jΣj,kˆΣj,k}2exp(328nρ(Cj,k))+(4θ22θ3d+10)exp(nρ(Cj,k)512(θ22/θ3)2d2  +643(θ22/θ3)d).

    (b) P{ˆQj,k>2θ3d 22j}2exp(328nρ(Cj,k))+(4θ22θ3d+10)exp(nρ(Cj,k)128(θ22/θ3)2d2  +323(θ22/θ3)d).

    (c) Suppose f is in L. For every t>0, we have

    P{rj,kˆrj,k>t}2exp(328nρ(Cj,k))+8exp(nρ(Cj,k)t24θ22f222j+2θ2f2jt)+2exp(cnρ(Cj,k)t2θ22ζ2ψ222j)

    where c is an absolute constant.

    Lemma 5. Suppose f is in L. For every t>0, we have

    P{fj,kˆfj,k>t}{C0[exp(nρ(Cj,k)c0)+exp(nρ(Cj,k)t2c0(f2+ft))+exp(nρ(Cj,k)t2c0ζ2ψ2)]for  =0C1d[exp(nρ(Cj,k)c1d2  )+exp(nρ(Cj,k)t2c1d4(f2+ft))+exp(nρ(Cj,k)t2c1d2  ζ2ψ2)]for  =1,

    where c0,C0 are absolute constants, c0 depends on θ2, and c1,C1 depend on θ2,θ3.

    Lemma 6. Suppose fL. For every Cj,kT and Cj,k  Cj,k,

    fj,kfj,k2M,ˆfj,kˆfj,k2M.

    Lemma 7. Suppose f is in L. For every η>0 and any γ>1, we have

    (a) P{ˆΔj,k<η&Δj,k(2amax+5)η}{C0exp(nη2c0max{f2ζ|2ψ2) for =0C1dexp(nη2c1max|d4f2,d2|ζ2ψ2}) for =1;

    (b) P{Δj,k<η&ˆΔj,k(2amax+5)η}{C0exp(nη2c0max{f2ζ|2ψ2) for =0C1dexp(nη2c1max|d4f2,d2|ζ2ψ2}) for =1;

    C0,c0 depend on amax; c0 depends on amax,θ2; C1,c1 depend on amax,θ2,θ3.

    This work was partially supported by NSF-DMS-125012, AFOSR FA9550-17-1-0280, NSF-IIS-1546392. The authors thank Duke University for donating computing equipment used for this project.

    The authors declare no conflict of interest.

    Example 1. Let fC,α. The local estimator fj,k minimizes (fp)1j,k over all possible polynomials p of order less than or equal to . Thus, in particular, we have (ffj,k)1j,k(fp)1j,k where p is equal to the -order Taylor polynomial of f at some zCj,k. Hence, for xCj,k there is ξMBθ22j(z) such that

    |f(x)p(x)||λ|=1λ!|λf(ξ)λf(z)||xz|λ|f|C,αξzα|λ|=1λ!|xz|λd!|f|C,αξzαxzθ+α2d!|f|C,α2j(+α).

    Therefore, for every j and kKj, we have

    (ffj,k)1j,k2θ2(+α)2(d!)2|f|2C,α22j(+α)ρ(Cj,k).

    Examples 2 and 3. For polynomial estimators of any fixed order =0,1,, gj,kg1j,k=0 when Cj,kΓ=, and gj,kg1j,k=O(1) when Cj,kΓ. At the scale j, ρ(Cj,k)2jd and ρ({Cj,k:Cj,kΓ})2j(ddΓ)ρ(Γ). Therefore,

    gΛjgO(2j(ddΓ))=O(2j(ddΓ)/2),

    which implies gA(ddΓ)/2.

    In adaptive approximations, Δj,k=0 when Cj,kΓ=. When Cj,kΓ, Δj,k=gj,kCj+1,k  Cj,kgj+1,kρ(Cj,k)2jd/2. Given any fixed threshold τ>0, in the truncated tree T(τ), the leaf nodes intersecting with Γ satisfy 2jd/2τ. In other words, around Γ the tree is truncated at a coarser scale than j such that 2j=O(τ2d). The cardinality of T(τ) is dominated by the nodes intersecting with Γ, so

    #T(τ)ρ(Γ)2j(ddΓ)2jd=ρ(Γ)2jdΓτ2dΓd,

    which implies p=2dΓ/d. We conclude that gBs with s=d(2p)2p=ddΓ(ddΓ)/2.

    Lemma 1. By definition, we have (ffj,k)1j,k|f|A,s2jsρ(Cj,k) as long as fA,s. By splitting (Δj,k)22(ffj,k)1j,k2+2k:Cj+1,k  Cj,k(ffj+1,k)1j+1,k2, we get

    (Δj,k)24|f|2A,s22jsρ(Cj,k).

    In the selection of adaptive partitions, every Cj,k with Δj,kτ must satisfy ρ(Cj,k)22js(τ/|f|A,s)2. With extra assumptions ρ(Cj,k)θ02jd (true when the measure ρ is doubling), we have

    Δj,kτ2j(τ|f|A,s)22s+d. (3)

    Therefore, every cell in Λ(τ) will be at a coarser scale than j with j satisfying (3). Using (A3) we thus get

    τp#T(τ)τpamin#Λjθ11τpamin2jdamin|f|2d2s+dA,sθ1

    which yields the result.

    Lemma 2. For any partition Λ  T, denote by Λl the l-th generation partition such that Λ0=Λ and Λl+1 consists of the children of Λl. We first prove that limlfΛl=f in L2(M). Suppose fL. Notice that fΛlff0Λlf. As a result of the Lebesgue differentiation theorem, f0Λlf almost everywhere. Since f is bounded, f0Λl is uniformly bounded, hence f0Λlf in L2(M) by the dominated convergence theorem. In the case where fAt, taking the uniform partition Λj(l) at the coarsest scale of Λl, denoted by j(l), we have ffΛlffΛj(l)2j(l)t, and therefore fΛlf in L2(M).

    Now, setting Λ=Λ(τ) and S:=T(τ)Λ, by Definitions 3 and 4 we get

    fΛf2=L1l=0(fΛlfΛl+1)+fΛLf2=l=0(fΛlfΛl+1)2=Cj,kTSWj,k2B0Cj,kTSWj,k2=B0l=0Δj,k[2(l+1)τ,2lτ)Δ2j,kB0l=022lτ2#T(2(l+1)τ)=B02pτ2pl=02(2p)l|f|pBsB02p|f|pBsτ2pl=02(2p)l,

    which yields the first inequality in Lemma 2. The second inequality follows by observing that 2p=2sdp and |f|pBsτ2p=|f|2Bs(|f|pBsτp)2sd|f|2Bs#[T(τ)]2sd by Definition 3.

    Lemma 3. See [34].

    Lemma 4. (a). Thanks to [23,Theorem 3.2] and assumption (A5), we have

    Qj,kˆQj,k=[Σj,k]d[ˆΣj,k]d3Σj,kˆΣj,k(λj,kdλj,kd+1Σj,kˆΣj,k)23Σj,kˆΣj,k(θ32d22jΣj,kˆΣj,k)2.

    Hence, the bound follows applying Lemma 3(d) with t=θ34d22j.

    (b). Observe that ˆQj,k[ˆΣj,k]d=(ˆλj,kd)1. Moreover, ˆλj,kdλj,kd|λj,kdˆλj,kd|θ3d22jΣj,kˆΣj,k by assumption (A5). Thus, using Lemma 3(d) with t=θ32d22j yields the result.

    (c). We condition on the event that ˆnj,k12Eˆnj,k=12nρ(Cj,k), whose complement occurs with probability lower than 2exp(328nρ(Cj,k)) by Lemma 3(b). The quantity rj,kˆrj,k is bounded by A+B+C+D with

    A:=1ˆnj,kni=1(f(xi)[xicj,k2j]1ρ(Cj,k)Cj,kf(x)[xcj,k2j]dρ)1j,k(xi)B:=1ˆnj,kni=1f(xi)[cj,kˆcj,k0]1j,k(xi)C:=1ˆnj,kni=1ζi[xicj,k2j]1j,k(xi)D:=1ˆnj,kni=1ζi[cj,kˆcj,k0]1j,k(xi).

    Each term of the sum in A has expectation 0 and bound 2θ2f2j. Thus, applying the Bernstein inequality [44,Corollary 7.3.2] we obtain

    P{A>t}8exp(cnρ(Cj,k)t2θ22f222j+θ2f2jt).

    B is bounded by fcj,kˆcj,k so that, using 3(c) with t replaced by t/f, we get

    P{B>t}2exp(328nρ(Cj,k))+8exp(cnρ(Cj,k)t2θ22f222j+θ2f2jt).

    To estimate C we appeal to [7,Theorem 3.1,Remark 4.2]. For XRn, take G(X):=MX with M:=[x1cj,kxncj,k2j2j]. Then |iG(X)|xicj,kθ22j. Now let X=(ζ11j,k(x1),,ζn1j,k(xn))T, so that C=G(X)/ˆnj,k. Since the ζi's are independent, [7,Remark 4.2] applies, and it yields P{G(X)>t}2exp(t22σ2), where σ2=ni=1iG2ζi2ψ21j,k(xi)ˆnj,kθ2222jζ2ψ2, and thus

    P{C>t}2exp(nρ(Cj,k)t22θ22ζ2ψ222j).

    We are left with D. This term is smaller than cj,kˆcj,k|1ˆnj,kni=1ζi1j,k(xi)|, where, by Lemma 3(c), cj,kˆcj,kθ22j with probability higher than 110exp(328nρ(Cj,k)). Hence, by the standard sub-Gaussian tail inequality [46,Proposition 5.10] we have

    P{D>t}10exp(328nρ(Cj,k))+eexp(cnρ(Cj,k)t2θ22ζ2ψ222j).

    This completes the proof.

    Lemma 5. If =0, then ˆfj,kˆfj,k=|yj,kˆyj,k|, which is less than

    |1ˆnj,kni=1(f(xi)1ρ(Cj,k)Cj,kf(x)dρ(x))1j,k(xi)|+|1ˆnj,kni=1ζi1j,k(xi)|.

    Each addend in the first term has expectation 0 and bound 2f, and therefore we can apply the standard Bernstein inequality [44,Theorem 1.6.1]. As for the second term, we use the standard sub-Gaussian tail inequality [46,Proposition 5.10]. This yields the bounds for =0.

    For =1, we have

    |fj,k(x)ˆfj,k(x)||[(xcj,k)T 2j]TQj,krj,k[(xˆcj,k)T 2j]TˆQj,kˆrj,k| |[(cj,kˆcj,k)T0]Qj,krj,k|+|[(xˆcj,k)T2j](Qj,krj,kˆQj,kˆrj,k)| cj,kˆcj,k Qj,k rj,k+[(xˆcj,k)T2j] Qj,krj,kˆQj,kˆrj,k cj,kˆcj,k Qj,k rj,k+[(xˆcj,k)T2j] (Qj,kˆQj,k rj,k+ˆQj,krj,kˆrj,k) θ22θ3(df2jcj,kˆcj,k+d2  M22jΣj,kˆΣj,k+d2jrj,kˆrj,k),

    where the last inequality holds with high probability thanks to Lemma 4(a)(b). Thus, applying Lemma 3(c)(d) and Lemma 4(c) with t replaced by tθdM2j, tθd2  M22j and tθd2j, we obtain the desired result.

    Lemma 6. Follows simply by truncation.

    Lemma 7. We start with (a). Defining ¯Δj,k:=Wj,kn we have

    P{ˆΔj,k<η   and   Δj,k(2amax+5)η} P{ˆΔj,k<η   and   ¯Δj,k(amax+2)η}+P{¯Δj,k<(amax+2)η   and   Δj,k(2amax+5)η} P{|¯Δj,kˆΔj,k|(1+amax)η}+P{|Δj,k2¯Δj,k|η}.

    The first quantity can be bounded by

    |¯Δj,kˆΔj,k|Wj,kˆWj,knfj,kˆfj,kn+Cj+1,k  Cj,kfj+1,kˆfj+1,knfj,kˆfj,kˆρ(Cj,k)+Cj+1,k  Cj,kfj+1,kˆfj+1,kˆρ(Cj+1,k),

    so that

    P{|¯Δj,kˆΔj,k|(1+amax)η} P{fj,kˆfj,kˆρ(Cj,k)η}+Cj+1,k  Cj,kP{fj+1,kˆfj+1,kˆρ(Cj+1,k)η}.

    We now condition on the event that |ρ(Cj,k)ˆρ(Cj,k)|12ρ(Cj,k), which entails ˆρ(Cj,k)32ρ(Cj,k), and apply Lemma 5 with tη/ρ(Cjk). The probability of the complementary event is bounded by Lemma 3(b). To get rid of the remaining ρ(Cj,k)'s inside the exponentials, we lower bound ρ(Cj,k) as follows. We have

    (Δj,k)24f2ρ(Cj,k). (4)

    Thus, Δj,k(2a+5)η implies ρ(Cj,k)(2a+5)2η24f2. Therefore, we obtain that

    P{|¯Δj,kˆΔj,k|(1+amax)η}{C0exp(nη2c0max{f2,ζ2ψ2})(=0)C1dexp(nη2c1max{d4f2,d2  ζ2ψ2})(=1),

    where C0,c0 depend on a, and C1,c1 depend on amax,θ2,θ3.

    Next we estimate P{Δj,k2¯Δj,kη} by [21,Theorem 11.2]. Notice that for all xM, |Wj,k(x)|f. If xCj,k, then Wj,k(x)=0, otherwise there is k such that xCj+1,k  Cj,k. In such a case, |Wj,k(x)|=|fj,k(x)fj+1,k(x)|, and the claim follows from Lemma 6. Thus, [21,Theorem 11.2] gives us

    P{Δj,k2¯Δj,kη}exp(nη2cf2),

    where c is an absolute constant.

    Let us turn to (b). We first observe that

    ˆΔj,kMˆρ(Cj,k). (5)

    To see this, note again that ˆWj,k(x)0 only when xCj+1,k  Cj,k for some k, in which case |ˆWj,k(x)|=|ˆfj,k(x)ˆfj+1,k(x)| and we can apply Lemma 6. Now note that b=2amax+5. We have

    P{Δj,k<η   and   ˆΔj,kbη}P{Δj,k<η, ˆΔj,kbη   and   ρ(Cj,k)b2η22f2}+P{ρ(Cj,k)<b2η22f2  and ˆρ(Cj,k)b2η2f2}+P{ˆρ(Cj,k)<b2η2f2  given  ˆΔj,kbη}.

    The first probability can be estimated similarly to how we did for (a). Thanks to Lemma 3(a), the second probability is bounded by

    P{ρ(Cj,k)<b2η22f2   and   |ˆρ(Cj,k)ρ(Cj,k)|>b2η22f2}exp(b2nη2cf2)

    for an absolute constant c. Finally, the third probability is zero thanks to (5).



    [1] W. K. Allard, G. Chen, M. Maggioni, Multi-scale geometric methods for data sets II: geometric multi-resolution analysis, Appl. Comput. Harmon. Anal., 32 (2012), 435-462. doi: 10.1016/j.acha.2011.08.001
    [2] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396. doi: 10.1162/089976603321780317
    [3] A. Beygelzimer, S. Kakade, J. Langford, Cover trees for nearest neighbor, In: Proceedings of the 23rd international conference on Machine learning, 2006, 97-104.
    [4] P. J. Bickel, B. Li, Local polynomial regression on unknown manifolds, Lecture Notes-Monograph Series, 54 (2007), 177-186.
    [5] P. Binev, A. Cohen, W. Dahmen, R. A. DeVore, Universal algorithms for learning theory part II: Piecewise polynomial functions, Constr. Approx., 26 (2007), 127-152.
    [6] P. Binev, A. Cohen, W. Dahmen, R. A. DeVore, V. N. Temlyakov, Universal algorithms for learning theory part I: Piecewise constant functions, J. Mach. Learn. Res., 6 (2005), 1297-1321.
    [7] V. Buldygin, E. Pechuk, Inequalities for the distributions of functionals of sub-Gaussian vectors, Theor. Probability and Math. Statist., 80 (2010), 25-36.
    [8] G. Chen, G. Lerman, Spectral Curvature Clustering (SCC), Int. J. Comput. Vis., 81 (2009), 317-330.
    [9] G. Chen, M. Maggioni, Multiscale geometric and spectral analysis of plane arrangements, In: IEEE Conference on Computer Vision and Pattern Recognition, 2011, 2825-2832.
    [10] M. Christ, A T(b) theorem with remarks on analytic capacity and the {C}auchy integral, Colloq. Math., 60/61 (1990), 601-628.
    [11] A. Cohen, W. Dahmen, I. Daubechies, R. A. DeVore, Tree approximation and optimal encoding, Appl. Comput. Harmon. Anal., 11 (2001), 192-226. doi: 10.1006/acha.2001.0336
    [12] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, et al., Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, PNAS, 102 (2005), 7426-7431.
    [13] I. Daubechies, Ten lectures on wavelets, SIAM, 1992.
    [14] D. Deng, Y. Han, Harmonic analysis on spaces of homogeneous type, Springer, 2008.
    [15] D. L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data, PNAS, 100 (2003), 5591-5596.
    [16] D. L. Donoho, J. M. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biometrika, 81 (1994), 425-455.
    [17] D. L. Donoho, J. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Am. Stat. Assoc., 90 (1995), 1200-1224.
    [18] E. Elhamifar, R. Vidal, Sparse subspace clustering, In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, 2790-2797.
    [19] H. Federer, Curvature measures, T. Am. Math. Soc., 93 (1959), 418-491.
    [20] J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Springer, 2001.
    [21] L. Györfi, M. Kohler, A. Krzyżak, H. Walk, A distribution-free theory of nonparametric regression, Springer, 2002.
    [22] N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure with randomness: stochastic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), 217-288.
    [23] P. C. Hansen, The truncated SVD as a method for regularization, Bit Numer. Math., 27 (1987), 534-553.
    [24] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24 (1933), 417-441.
    [25] H. Hotelling, Relations between two sets of variates, Biometrika, 28 (1936), 321-377.
    [26] I. T. Jolliffe, A note on the use of principal components in regression, J. C. Stat. Soc. C. Appl., 31 (1982), 300-303.
    [27] G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1999), 359-392.
    [28] T. Klock, A. Lanteri, S. Vigogna, Estimating multi-index models with response-conditional least squares, Electron. J. Stat., 15 (2021), 589-629.
    [29] S. Kpotufe, k-NN regression adapts to local intrinsic dimension, In: Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011,729-737.
    [30] S. Kpotufe, S. Dasgupta, A tree-based regressor that adapts to intrinsic dimension, J. Comput. Syst. Sci., 78 (2012), 1496-1515.
    [31] S. Kpotufe, V. K. Garg, Adaptivity to local smoothness and dimension in kernel regression, In: Advances in Neural Information Processing Systems 26 (NIPS 2011), 2013, 3075-3083.
    [32] A. Lanteri, M. Maggioni, S. Vigogna, Conditional regression for single-index models, 2020 arXiv: 2002.10008.
    [33] A. B. Lee, R. Izbicki, A spectral series approach to high-dimensional nonparametric regression, Electron. J. Stat., 10 (2016), 423-463.
    [34] W. Liao, M. Maggioni, Adaptive geometric multiscale approximations for intrinsically low-dimensional data, J. Mach. Learn. Res., 20 (2019), 1-63.
    [35] W. Liao, M. Maggioni, S. Vigogna, Learning adaptive multiscale approximations to data and functions near low-dimensional sets, In: IEEE Information Theory Workshop (ITW), 2016,226-230.
    [36] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, In: Proceedings of the 26 th International Conference on Machine Learning, 2010,663-670.
    [37] M. Maggioni, S. Minsker, N. Strawn, Multiscale dictionary learning: Non-asymptotic bounds and robustness, J. Mach. Learn. Res., 17 (2016), 1-51.
    [38] S. Mallat, A wavelet tour of signal processing, 2 Eds., Academic Press, 1999.
    [39] K. Pearson, On lines and planes of closest fit to systems of points in space, Philos. Mag., 2 (1901), 559-572.
    [40] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.
    [41] I. Steinwart, D. R. Hush, C. Scovel, Optimal rates for regularized least squares regression, In: The 22nd Annual Conference on Learning Theory, 2009.
    [42] A. Szlam, Asymptotic regularity of subdivisions of euclidean domains by iterated PCA and iterated 2-means, Appl. Comput. Harmon. Anal., 27 (2009), 342-350.
    [43] J. B. Tenenbaum, V. D. Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.
    [44] J. A. Tropp, User-friendly tools for random matrices: An introduction, NIPS version, 2012.
    [45] A. B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
    [46] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices, In: Compressed sensing, Cambridge University Press, 2012,210-268.
    [47] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA), IEEE T. Pattern Anal., 27 (2005), 1945-1959.
    [48] G. B. Ye, D. X. Zhou, Learning and approximation by Gaussians on Riemannian manifolds, Adv. Comput. Math., 29 (2008), 291-310. doi: 10.1007/s10444-007-9049-0
    [49] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimension reduction via local tangent space alignment, SIAM J. Sci. Comput., 26 (2002), 313-338.
    [50] X. Zhou, N. Srebro, Error analysis of Laplacian eigenmaps for semi-supervised learning, In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011,901-908.
  • This article has been cited by:

    1. Andreas Oslandsbotn, Željko Kereta, Valeriya Naumova, Yoav Freund, Alexander Cloninger, StreaMRAK a streaming multi-resolution adaptive kernel algorithm, 2022, 426, 00963003, 127112, 10.1016/j.amc.2022.127112
    2. Alessandro Lanteri, Mauro Maggioni, Stefano Vigogna, Conditional regression for single-index models, 2022, 28, 1350-7265, 10.3150/22-BEJ1482
    3. Abiy Tasissa, Pranay Tankala, James M. Murphy, Demba Ba, K-Deep Simplex: Manifold Learning via Local Dictionaries, 2023, 71, 1053-587X, 3741, 10.1109/TSP.2023.3322820
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2677) PDF downloads(127) Cited by(3)

Figures and Tables

Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog