Our iterated random sampling model selection procedure
based on combinations of FSS and classification methods, was applied to
a gene expression dataset from 78 osteoarthritis (OA) patients (Fundel et al., 2005),
with each single channel cDNA array containing 7467 spots. According to
the stages of disease progression, the arrays are classified as either healthy (18 patients), early (20), peripheral (21) or central (19), whereby peripheral and central describe two variants of late OA.
The best model for this dataset between all four patient classes was
created by using combined p-value & fold change thresholds (p-value
< 0.05, |fold-change| > 1.5) as FSS method. Over-representation
analysis (p-value < 0.001, considering GO terms with more than 4 and
less than 100 spots) was applied as a post filter on the gene
signatures. The chosen classification method was kNN ().
We achieve a classification accuracy on the validation set of 73.0% at
a MAD of 9.4%. The gene signatures show a stability of 51%, containing
146 genes on average.
Our approach provides the model best suited to the given data
by comparing FSS methods in combination with classification methods. In
a ten-fold cross validation, the best models across the ten folds are
not always derived from the same FSS and classification methods.
However, they show consistent performance, ranging from 68.1% to 78.2%
accuracy on the respective validation datasets (average within-fold
The discrimination between healthy and early
degenerative cartilage patients proved to be most difficult. This was
not evident at first glance with a classification accuracy of 81.4% on
the validation data set, but our sampling procedure identified a
deviation of 24.7%, showing that the raw accuracy score even on the
validation data is already overly optimistic. As no stable signatures
could be found in this case, most of the classification performance
stems from overfitting of random variances. In comparison, between healthy and central
an accuracy of 98.1% with a deviation of 0.0% could be achieved simply
using decision tress with the Pearson correlation as the FSS method.
In conclusion, our method provides a robust estimate of the
quality of the model chosen and the separability of sample groups,
through repeated random sampling. Additionally, focusing on gene
signatures serves as a biological restraint on classifiers, reducing
the amount of random variance learned.