Statistical Issues in Machine Learning
Beschreibung
vor 16 Jahren
Recursive partitioning methods from machine learning are being
widely applied in many scientific fields such as, e.g., genetics
and bioinformatics. The present work is concerned with the two main
problems that arise in recursive partitioning, instability and
biased variable selection, from a statistical point of view. With
respect to the first issue, instability, the entire scope of
methods from standard classification trees over robustified
classification trees and ensemble methods such as TWIX, bagging and
random forests is covered in this work. While ensemble methods
prove to be much more stable than single trees, they also loose
most of their interpretability. Therefore an adaptive cutpoint
selection scheme is suggested with which a TWIX ensemble reduces to
a single tree if the partition is sufficiently stable. With respect
to the second issue, variable selection bias, the statistical
sources of this artifact in single trees and a new form of bias
inherent in ensemble methods based on bootstrap samples are
investigated. For single trees, one unbiased split selection
criterion is evaluated and another one newly introduced here. Based
on the results for single trees and further findings on the effects
of bootstrap sampling on association measures, it is shown that, in
addition to using an unbiased split selection criterion,
subsampling instead of bootstrap sampling should be employed in
ensemble methods to be able to reliably compare the variable
importance scores of predictor variables of different types. The
statistical properties and the null hypothesis of a test for the
random forest variable importance are critically investigated.
Finally, a new, conditional importance measure is suggested that
allows for a fair comparison in the case of correlated predictor
variables and better reflects the null hypothesis of interest.
widely applied in many scientific fields such as, e.g., genetics
and bioinformatics. The present work is concerned with the two main
problems that arise in recursive partitioning, instability and
biased variable selection, from a statistical point of view. With
respect to the first issue, instability, the entire scope of
methods from standard classification trees over robustified
classification trees and ensemble methods such as TWIX, bagging and
random forests is covered in this work. While ensemble methods
prove to be much more stable than single trees, they also loose
most of their interpretability. Therefore an adaptive cutpoint
selection scheme is suggested with which a TWIX ensemble reduces to
a single tree if the partition is sufficiently stable. With respect
to the second issue, variable selection bias, the statistical
sources of this artifact in single trees and a new form of bias
inherent in ensemble methods based on bootstrap samples are
investigated. For single trees, one unbiased split selection
criterion is evaluated and another one newly introduced here. Based
on the results for single trees and further findings on the effects
of bootstrap sampling on association measures, it is shown that, in
addition to using an unbiased split selection criterion,
subsampling instead of bootstrap sampling should be employed in
ensemble methods to be able to reliably compare the variable
importance scores of predictor variables of different types. The
statistical properties and the null hypothesis of a test for the
random forest variable importance are critically investigated.
Finally, a new, conditional importance measure is suggested that
allows for a fair comparison in the case of correlated predictor
variables and better reflects the null hypothesis of interest.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)