Unbiased split selection for classification trees based on the Gini Index
Beschreibung
vor 19 Jahren
The Gini gain is one of the most common variable selection criteria
in machine learning. We derive the exact distribution of the
maximally selected Gini gain in the context of binary
classification using continuous predictors by means of a
combinatorial approach. This distribution provides a formal support
for variable selection bias in favor of variables with a high
amount of missing values when the Gini gain is used as split
selection criterion, and we suggest to use the resulting p-value as
an unbiased split selection criterion in recursive partitioning
algorithms. We demonstrate the efficiency of our novel method in
simulation- and real data- studies from veterinary gynecology in
the context of binary classification and continuous predictor
variables with different numbers of missing values. Our method is
extendible to categorical and ordinal predictor variables and to
other split selection criteria such as the cross-entropy criterion.
in machine learning. We derive the exact distribution of the
maximally selected Gini gain in the context of binary
classification using continuous predictors by means of a
combinatorial approach. This distribution provides a formal support
for variable selection bias in favor of variables with a high
amount of missing values when the Gini gain is used as split
selection criterion, and we suggest to use the resulting p-value as
an unbiased split selection criterion in recursive partitioning
algorithms. We demonstrate the efficiency of our novel method in
simulation- and real data- studies from veterinary gynecology in
the context of binary classification and continuous predictor
variables with different numbers of missing values. Our method is
extendible to categorical and ordinal predictor variables and to
other split selection criteria such as the cross-entropy criterion.
Weitere Episoden
vor 11 Jahren
In Podcasts werben
Kommentare (0)