The behaviour of random forest permutation-based variable importance measures under predictor correlation
Podcast
Podcaster
Beschreibung
vor 14 Jahren
Background: Random forests (RF) have been increasingly used in
applications such as genome-wide association and microarray studies
where predictor correlation is frequently observed. Recent works on
permutation-based variable importance measures (VIMs) used in RF
have come to apparently contradictory conclusions. We present an
extended simulation study to synthesize results. Results: In the
case when both predictor correlation was present and predictors
were associated with the outcome (H(A)), the unconditional RF VIM
attributed a higher share of importance to correlated predictors,
while under the null hypothesis that no predictors are associated
with the outcome (H(0)) the unconditional RF VIM was unbiased.
Conditional VIMs showed a decrease in VIM values for correlated
predictors versus the unconditional VIMs under H(A) and was
unbiased under H(0). Scaled VIMs were clearly biased under H(A) and
H(0). Conclusions: Unconditional unscaled VIMs are a
computationally tractable choice for large datasets and are
unbiased under the null hypothesis. Whether the observed increased
VIMs for correlated predictors may be considered a "bias" - because
they do not directly reflect the coefficients in the generating
model - or if it is a beneficial attribute of these VIMs is
dependent on the application. For example, in genetic association
studies, where correlation between markers may help to localize the
functionally relevant variant, the increased importance of
correlated predictors may be an advantage. On the other hand, we
show examples where this increased importance may result in
spurious signals.
applications such as genome-wide association and microarray studies
where predictor correlation is frequently observed. Recent works on
permutation-based variable importance measures (VIMs) used in RF
have come to apparently contradictory conclusions. We present an
extended simulation study to synthesize results. Results: In the
case when both predictor correlation was present and predictors
were associated with the outcome (H(A)), the unconditional RF VIM
attributed a higher share of importance to correlated predictors,
while under the null hypothesis that no predictors are associated
with the outcome (H(0)) the unconditional RF VIM was unbiased.
Conditional VIMs showed a decrease in VIM values for correlated
predictors versus the unconditional VIMs under H(A) and was
unbiased under H(0). Scaled VIMs were clearly biased under H(A) and
H(0). Conclusions: Unconditional unscaled VIMs are a
computationally tractable choice for large datasets and are
unbiased under the null hypothesis. Whether the observed increased
VIMs for correlated predictors may be considered a "bias" - because
they do not directly reflect the coefficients in the generating
model - or if it is a beneficial attribute of these VIMs is
dependent on the application. For example, in genetic association
studies, where correlation between markers may help to localize the
functionally relevant variant, the increased importance of
correlated predictors may be an advantage. On the other hand, we
show examples where this increased importance may result in
spurious signals.
Weitere Episoden
In Podcasts werben
Abonnenten
München
Kommentare (0)