A Framework for Unbiased Model Selection Based on Boosting
Beschreibung
vor 14 Jahren
Variable selection and model choice are of major concern in many
statistical applications, especially in high-dimensional regression
models. Boosting is a convenient statistical method that combines
model fitting with intrinsic model selection. We investigate the
impact of base-learner specification on the performance of boosting
as a model selection procedure. We show that variable selection may
be biased if the covariates are of different nature. Important
examples are models combining continuous and categorical
covariates, especially if the number of categories is large. In
this case, least squares base-learners offer increased flexibility
for the categorical covariate and lead to a preference even if the
categorical covariate is non-informative. Similar difficulties
arise when comparing linear and nonlinear base-learners for a
continuous covariate. The additional flexibility in the nonlinear
base-learner again yields a preference of the more complex modeling
alternative. We investigate these problems from a theoretical
perspective and suggest a framework for unbiased model selection
based on a general class of penalized least squares base-learners.
Making all base-learners comparable in terms of their degrees of
freedom strongly reduces the selection bias observed in naive
boosting specifications. The importance of unbiased model selection
is demonstrated in simulations and an application to forest health
models.
statistical applications, especially in high-dimensional regression
models. Boosting is a convenient statistical method that combines
model fitting with intrinsic model selection. We investigate the
impact of base-learner specification on the performance of boosting
as a model selection procedure. We show that variable selection may
be biased if the covariates are of different nature. Important
examples are models combining continuous and categorical
covariates, especially if the number of categories is large. In
this case, least squares base-learners offer increased flexibility
for the categorical covariate and lead to a preference even if the
categorical covariate is non-informative. Similar difficulties
arise when comparing linear and nonlinear base-learners for a
continuous covariate. The additional flexibility in the nonlinear
base-learner again yields a preference of the more complex modeling
alternative. We investigate these problems from a theoretical
perspective and suggest a framework for unbiased model selection
based on a general class of penalized least squares base-learners.
Making all base-learners comparable in terms of their degrees of
freedom strongly reduces the selection bias observed in naive
boosting specifications. The importance of unbiased model selection
is demonstrated in simulations and an application to forest health
models.
Weitere Episoden
vor 11 Jahren
In Podcasts werben
Kommentare (0)