Reliable gene signatures for microarray classification: assessment of stability and performance

Gene expression data enables sample classification based on subsets of genes (gene signatures). However, approaches that simply train classifiers on the basis of gene signatures have shown significant instability, as discussed in Michiels et al. (2005). Expanding upon this work, we consider all possible combinations of gene selection methods with classification methods. These combinations are then evaluated via random sampling. Signatures that prove to be stable over these samplings provide a more consistent insight into the underlying biological processes that distinguish sample groups.

Show all Contributors

Chad A. Davis^{1, 2, 3}
Fabian Gerick^{1, 2, 3}
Volker Hintermair^{1, 2, 3}
Caroline C. Friedel¹
Katrin Fundel¹
Robert Küffner¹
Ralf Zimmer¹

¹ Institut für Informatik, Ludwig-Maximilians-Universität, Amalienstraße 17, 80333 Munich, Germany
² Fakultät für Informatik, Technische Universität München, Bolzmannstraße 3, 85748 Garching, Germany
³ These authors contributed equally to this work

Methods

A group of gene expression arrays is randomly split into a training set and a validation set several times (random sampling). Different FSS (gene/feature subset selection) methods are used to produce gene signatures for these training sets based on the pre-defined disease stages of the arrays. Gene signatures were produced by: Pearson correlation, F-test, and combined p-value & fold change thresholds. Additionally support vector machines (SVMs) and decision trees were used as wrapper methods for FSS.

Resulting signatures were optionally filtered by: GO annotation (Ashburner et al., 2000), over-representation analysis ( Draghici et al., 2003), or sampling (Michiels et al., 2005) (genes occurring in 50% of signatures in a subsequent re-sampling step).

A model is derived by pairing the aforementioned FSS methods with a number of multi-classification methods: kNN ( Nearest Neighbor), SVMs (polynomial or RBF kernels) and decision trees. For each of these combinations a model is produced in each sampling step, composed of the respective gene signature and the classifier. We select the best combination (model selection) by evaluating the models according to a number of factors: 1) stability of the gene signatures, 2) length of the gene signatures, 3) accuracy of the classifier on the validation set over all the sampling steps, 4) median absolute deviation (MAD) of the classification accuracies over all sampling steps. The best combination is used to create a final model on the full data set.

Overly optimistic estimates are avoided by incorporating the variance of the individual classification accuracies. Additionally, requiring stable gene signatures enforces a biological control on ``black box'' classifiers. The stability of an FSS method is estimated by the fraction of gene signatures in which a gene is selected on average. This prevents the classifiers from extracting information from random variances within the data. The system is verified by a balanced ten-fold cross validation.

Results and Conclusion

Our iterated random sampling model selection procedure based on combinations of FSS and classification methods, was applied to a gene expression dataset from 78 osteoarthritis (OA) patients (Fundel et al., 2005), with each single channel cDNA array containing 7467 spots. According to the stages of disease progression, the arrays are classified as either healthy (18 patients), early (20), peripheral (21) or central (19), whereby peripheral and central describe two variants of late OA.

The best model for this dataset between all four patient classes was created by using combined p-value & fold change thresholds (p-value < 0.05, |fold-change| > 1.5) as FSS method. Over-representation analysis (p-value < 0.001, considering GO terms with more than 4 and less than 100 spots) was applied as a post filter on the gene signatures. The chosen classification method was kNN (). We achieve a classification accuracy on the validation set of 73.0% at a MAD of 9.4%. The gene signatures show a stability of 51%, containing 146 genes on average.

Our approach provides the model best suited to the given data by comparing FSS methods in combination with classification methods. In a ten-fold cross validation, the best models across the ten folds are not always derived from the same FSS and classification methods. However, they show consistent performance, ranging from 68.1% to 78.2% accuracy on the respective validation datasets (average within-fold deviation: 13.1%).

The discrimination between healthy and early degenerative cartilage patients proved to be most difficult. This was not evident at first glance with a classification accuracy of 81.4% on the validation data set, but our sampling procedure identified a deviation of 24.7%, showing that the raw accuracy score even on the validation data is already overly optimistic. As no stable signatures could be found in this case, most of the classification performance stems from overfitting of random variances. In comparison, between healthy and central an accuracy of 98.1% with a deviation of 0.0% could be achieved simply using decision tress with the Pearson correlation as the FSS method.

In conclusion, our method provides a robust estimate of the quality of the model chosen and the separability of sample groups, through repeated random sampling. Additionally, focusing on gene signatures serves as a biological restraint on classifiers, reducing the amount of random variance learned.

Documentation

The documentation for package `StabPerf' version 0.5 can be found here.

Bibliography

2000

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G.. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1), pp. 25-9, 2000.

2003

Draghici, S., Khatri, P., Martins, R. P., Ostermeier, G. C., and Krawetz, S. A.. Global functional profiling of gene expression. Genomics, 81(2), pp. 98-104, 2003.

2005

Papers

Katrin Fundel, R. Küffner, Thomas Aigner, Ralf Zimmer. Data Processing Effects on the Interpretation of Microarray Gene Expression Experiments. A. Torda, S. Kurtz, Matthias Rarey (eds.): German Conference on Bioinformatics (GCB) 2005, GI Lecture Notes in Informatics, vol P-71, pp. 77-91, GI, 2005.

BibTex

Michiels, S., Koscielny, S., and Hill, C.. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365(9458), pp. 488-92, 2005.

Download

The StabPerf package contains the core functionality of the gene expression analysis pipeline:

StabPerf_0.5.4.tar.gz

StabPerf_0.5.2.tar.gz

StabPerf_0.5.1.tar.gz

StabPerf_0.5.tar.gz

The system uses a modified version of the e1071 library to implement a custom feature selection wrapper method based on support vector machines:

e1071_1.5-8-1.tar.gz

Internally, documentation is generated using a custom version of the mvbutils library. This is not required, unless you plan to redistribute this package.

mvbutils_1.1.3.tar.gz

Suche

Links und Funktionen

Sprachauswahl

Benutzermenü

Navigationspfad

Hauptnavigation

Inhalt

Reliable gene signatures for microarray classification: assessment of stability and performance

Servicebereich

Fußzeile