Understanding Additive Models and the Curse of Dimensionality

Curse of Dimensionality

7.0 AM; 7.1 Curse of Dimensionality (Richard Bellman): The problem of estimating f becomes vastly harder as p, the dimension of x, increases.

Nonparametric regression model: y=f(x)+e

Minor assumptions:

  1. The xi are measured without error.
  2. The ei are independently and identically distributed (i.i.d.) with mean 0.
  3. Variance of the ei values = unknown constant σ.

Common assumptions:

  1. f is in a Sobolev space (functions with bounded derivatives).
  2. f has a bounded number of discontinuities.

Minimize COD: MARS, CART, PPR, Loess, Random Forests, and Support Vector Machines.

COD applies to: all multivariate analyses not imposing strong modeling assumptions, with equal force to classification, cluster analysis, and multidimensional scaling.

COD descriptions:

  1. For fixed n, as p increases, the data become sparse.
  2. As p increases, the number of possible models explodes (2[1+2p+pC2]-1; increases superexponentially in p; not enough sample to enable the data to discriminate models).
  3. For large p, most datasets are multicollinear (occurs when the explanatory values concentrate on an affine subspace in Rp; implies that the predictive value of the fitted model breaks down quickly as one moves away from the subspace in which the data concentrate) or concurve (nonparametric generalization of multicollinearity; occurs when the data concentrate on some smooth manifold (smooth if the smoother used in backfitting can interpolate all the xi perfectly) within Rp).

Evade COD results:

  1. Barron 1994: Neural networks avoid COD.
  2. Zhao/Atkeson 1991: PPR evade COD.
  3. Wozniakowski 1991: Modification of Hammersley points to dodge COD in multivariate integration.

Goals

Multiple linear regression model: Y=B0+B1X1+…+BpXp+e; E[e]=0; var(e)=σ2; ei i.i.d. x1xp

Useful because:

  1. Interpretable (each x by single B).
  2. Theory supports inference and prediction easily.
  3. Simple interactions and transformations are easy.
  4. Dummy variables allow use of categorical information.
  5. Computation is fast.

Additive Models

Expresses the response variable Xk as a sum (k=1 to p) of individual functions fk of predictor variables.

Basic assumptions: As in 7.1, but add E[fk(xk)]=0 to prevent identifiability problems.

Notes:

  1. One can require that some of the fk be linear or monotone.
  2. One can include some low-dimensional smooths (f(x1,x2)).
  3. One can include some kinds of interactions (f(x1x2)).
  4. Transformation of variables is done automatically.
  5. Many regression diagnostics, such as Cook’s distance, generalize to additive models.
  6. Ideas from weighted regression generalize to handle heteroscedasticity.
  7. Approximate deviance tests for comparing nested additive models are available.
  8. One can use the bootstrap to set pointwise confidence bands on the fk.

The Backfitting Algorithm

Used to fit additive models; allows one to use an arbitrary smoother (spline, Loess, kernel) to estimate the fk; solves these p estimating equations iteratively; at each stage, replaces the conditional expectation of the partial residuals YB0k!=j fk(xk) with a univariate smooth.

Backfitting algorithm:

  1. Initialize. Set b0=Ybar and set the fk functions to be something reasonable (linear regression). Set the fk vectors to match.
  2. Cycle. For j = 1, . . . , p set fk=S[YB0k!=j fk|x.k] and update the fk to match.
  3. Iterate. Repeat step 2 until the changes in the fk between iterations are sufficiently small.

The iterative solution for this has the structure of a Gauss-Seidel algorithm for linear systems (Hastie/Tibshirani).

Converges: for smoothers (spline, kernel, not Loess) that correspond to a symmetric smoothing matrix with all eigenvalues in (0,1); solution is unique unless there is concurvity (solution depends upon the initial conditions). Concurve space of Pg=Qh (estimating equations basis form) is the set of additive functions such that Pg=0.

Generalized Linear Models

Assumes that there is a link function g(y) such that g(E[Y|x])=xTB (McCullagh/Nelder).

Generalized additive model expresses the link function as an additive, rather than linear, function of x; GLMs are fit by iterative scoring, a form of iteratively reweighted least squares, the GAM modifies backfitting in a similar way (Hastie/Tibshirani 1990).