Analysis of missing data with random forests ~ Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU

Random Forests are widely used for data prediction and
interpretation purposes. They show many appealing characteristics,
such as the ability to deal with high dimensional data, complex
interactions and correlations. Furthermore, missing values can
easily be processed by the built-in procedure of surrogate splits.
However, there is only little knowledge about the properties of
recursive partitioning in missing data situations. Therefore,
extensive simulation studies and empirical evaluations have been
conducted to gain deeper insight. In addition, new methods have
been developed to enhance methodology and solve current issues of
data interpretation, prediction and variable selection. A
variable’s relevance in a Random Forest can be assessed by means of
importance measures. Unfortunately, existing methods cannot be
applied when the data contain miss- ing values. Thus, one of the
most appreciated properties of Random Forests – its ability to
handle missing values – gets lost for the computation of such
measures. This work presents a new approach that is designed to
deal with missing values in an intuitive and straightforward way,
yet retains widely appreciated qualities of existing methods.
Results indicate that it meets sensible requirements and shows good
variable ranking properties. Random Forests provide variable
selection that is usually based on importance mea- sures. An
extensive review of corresponding literature led to the development
of a new approach that is based on a profound theoretical framework
and meets important statis- tical properties. A comparison to
another eight popular methods showed that it controls the test-wise
and family-wise error rate, provides a higher power to distinguish
relevant from non-relevant variables and leads to models located
among the best performing ones. Alternative ways to handle missing
values are the application of imputation methods and complete case
analysis. Yet it is unknown to what extent these approaches are
able to provide sensible variable rankings and meaningful variable
selections. Investigations showed that complete case analysis leads
to inaccurate variable selection as it may in- appropriately
penalize the importance of fully observed variables. By contrast,
the new importance measure decreases for variables with missing
values and therefore causes se- lections that accurately reﬂect the
information given in actual data situations. Multiple imputation
leads to an assessment of a variable’s importance and to selection
frequencies that would be expected for data that was completely
observed. In several performance evaluations the best prediction
accuracy emerged from multiple imputation, closely fol- lowed by
the application of surrogate splits. Complete case analysis clearly
performed worst.

Analysis of missing data with random forests

Beschreibung

Weitere Episoden

Generalized Bayesian inference under prior-data conflict

Regularity for degenerate elliptic and parabolic systems

Reifegradmodelle für Werkzeuglandschaften zur Unterstützung von ITSM-Prozessen

Similarity search and mining in uncertain spatial and spatio-temporal databases

Tensor factorization for relational learning

Kommentare (0)

Abonnenten

Anmelden mit