Statistical Issues in Machine Learning ~ Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU

Recursive partitioning methods from machine learning are being
widely applied in many scientific fields such as, e.g., genetics
and bioinformatics. The present work is concerned with the two main
problems that arise in recursive partitioning, instability and
biased variable selection, from a statistical point of view. With
respect to the first issue, instability, the entire scope of
methods from standard classification trees over robustified
classification trees and ensemble methods such as TWIX, bagging and
random forests is covered in this work. While ensemble methods
prove to be much more stable than single trees, they also loose
most of their interpretability. Therefore an adaptive cutpoint
selection scheme is suggested with which a TWIX ensemble reduces to
a single tree if the partition is sufficiently stable. With respect
to the second issue, variable selection bias, the statistical
sources of this artifact in single trees and a new form of bias
inherent in ensemble methods based on bootstrap samples are
investigated. For single trees, one unbiased split selection
criterion is evaluated and another one newly introduced here. Based
on the results for single trees and further findings on the effects
of bootstrap sampling on association measures, it is shown that, in
addition to using an unbiased split selection criterion,
subsampling instead of bootstrap sampling should be employed in
ensemble methods to be able to reliably compare the variable
importance scores of predictor variables of different types. The
statistical properties and the null hypothesis of a test for the
random forest variable importance are critically investigated.
Finally, a new, conditional importance measure is suggested that
allows for a fair comparison in the case of correlated predictor
variables and better reflects the null hypothesis of interest.

Statistical Issues in Machine Learning

Beschreibung

Weitere Episoden

Generalized Bayesian inference under prior-data conflict

Regularity for degenerate elliptic and parabolic systems

Reifegradmodelle für Werkzeuglandschaften zur Unterstützung von ITSM-Prozessen

Similarity search and mining in uncertain spatial and spatio-temporal databases

Tensor factorization for relational learning

Kommentare (0)

Abonnenten

Anmelden mit