Understanding Additive Models and the Curse of Dimensionality
Curse of Dimensionality
7.0 AM; 7.1 Curse of Dimensionality (Richard Bellman): The problem of estimating f becomes vastly harder as p, the dimension of x, increases.
Nonparametric regression model: y=f(x)+e
Minor assumptions:
- The xi are measured without error.
- The ei are independently and identically distributed (i.i.d.) with mean 0.
- Variance of the ei values = unknown constant σ.
Common assumptions:
- f is in a Sobolev space (functions with bounded derivatives).
- f has a bounded number of discontinuities.
Minimize COD: MARS, CART, PPR, Loess, Random Forests, and Support Vector Machines.
COD applies to: all multivariate analyses not imposing strong modeling assumptions, with equal force to classification, cluster analysis, and multidimensional scaling.
COD descriptions:
- For fixed n, as p increases, the data become sparse.
- As p increases, the number of possible models explodes (2[1+2p+pC2]-1; increases superexponentially in p; not enough sample to enable the data to discriminate models).
- For large p, most datasets are multicollinear (occurs when the explanatory values concentrate on an affine subspace in Rp; implies that the predictive value of the fitted model breaks down quickly as one moves away from the subspace in which the data concentrate) or concurve (nonparametric generalization of multicollinearity; occurs when the data concentrate on some smooth manifold (smooth if the smoother used in backfitting can interpolate all the xi perfectly) within Rp).
Evade COD results:
- Barron 1994: Neural networks avoid COD.
- Zhao/Atkeson 1991: PPR evade COD.
- Wozniakowski 1991: Modification of Hammersley points to dodge COD in multivariate integration.
Goals
Multiple linear regression model: Y=B0+B1X1+…+BpXp+e; E[e]=0; var(e)=σ2; ei i.i.d. x1…xp
Useful because:
- Interpretable (each x by single B).
- Theory supports inference and prediction easily.
- Simple interactions and transformations are easy.
- Dummy variables allow use of categorical information.
- Computation is fast.
Additive Models
Expresses the response variable Xk as a sum (k=1 to p) of individual functions fk of predictor variables.
Basic assumptions: As in 7.1, but add E[fk(xk)]=0 to prevent identifiability problems.
Notes:
- One can require that some of the fk be linear or monotone.
- One can include some low-dimensional smooths (f(x1,x2)).
- One can include some kinds of interactions (f(x1x2)).
- Transformation of variables is done automatically.
- Many regression diagnostics, such as Cook’s distance, generalize to additive models.
- Ideas from weighted regression generalize to handle heteroscedasticity.
- Approximate deviance tests for comparing nested additive models are available.
- One can use the bootstrap to set pointwise confidence bands on the fk.
The Backfitting Algorithm
Used to fit additive models; allows one to use an arbitrary smoother (spline, Loess, kernel) to estimate the fk; solves these p estimating equations iteratively; at each stage, replaces the conditional expectation of the partial residuals Y–B0-Σk!=j fk(xk) with a univariate smooth.
Backfitting algorithm:
- Initialize. Set b0=Ybar and set the fk functions to be something reasonable (linear regression). Set the fk vectors to match.
- Cycle. For j = 1, . . . , p set fk=S[Y–B0-Σk!=j fk|x.k] and update the fk to match.
- Iterate. Repeat step 2 until the changes in the fk between iterations are sufficiently small.
The iterative solution for this has the structure of a Gauss-Seidel algorithm for linear systems (Hastie/Tibshirani).
Converges: for smoothers (spline, kernel, not Loess) that correspond to a symmetric smoothing matrix with all eigenvalues in (0,1); solution is unique unless there is concurvity (solution depends upon the initial conditions). Concurve space of Pg=Qh (estimating equations basis form) is the set of additive functions such that Pg=0.
Generalized Linear Models
Assumes that there is a link function g(y) such that g(E[Y|x])=xTB (McCullagh/Nelder).
Generalized additive model expresses the link function as an additive, rather than linear, function of x; GLMs are fit by iterative scoring, a form of iteratively reweighted least squares, the GAM modifies backfitting in a similar way (Hastie/Tibshirani 1990).