CYB5D1 was already mentioned surpassing all gene signatures with different gene evaluated

In total, we analyzed 1324 breast cancer samples from public data sets generated with three microarray technologies. To the best of our knowledge, this study is the largest one evaluating the potential benefits of data merging in a quantitative OS/RFS patients risk prediction framework. While these findings conciliate the results from data integration verification with those from gene signature evaluation, they also reveal the limited usefulness of the data intermingling test, which in this case provides a misleading picture of the variance retained after data integration. Noting that the gene signatures built from subsets of GSE4335 or Vijver showed higher prediction accuracies in cross-validation than the gene signature built from the merged data set, we investigated how the performance could possibly be improved by selective data integration. To assess the reproducibility of the gene signatures�� performance derived from the merged data sets, the prediction accuracy was evaluated in a leave-one-data set-out manner. In each step, one complete source data set was set aside as testing set while the predictor was built from the merged remaining sets. In parallel, we carried out pair-wise tests, using one source data set as training and another one as testing set. Table 2, Table 3 and Table S1 to S6 summarize the results of these evaluations with respect to the two clinical endpoints, OS and RFS. The survival prediction accuracy and Diacerein prognosis of clinical risk were neither increased nor decreased significantly by merging data sets. This is explained at least in part by the fact that important riskassociated genes were not present in all data sets. Consequently, the heterogeneity of the data sets generated from different laboratories and with different microarray technologies, was not the only, perhaps even not the major limiting factor for improving prediction accuracy by increasing sample size. Substantial variation of time to death or relapse among breast cancer patients and the heterogeneity of breast cancer disease are other constraining factors that nevertheless need to be considered. Moreover, the heterogeneity of patients cohorts in terms of age, lymph node status, tumor grade, tumor size and ER status might Chlorhexidine hydrochloride negatively affect the accuracy of survival prediction after merging. It is known, for example, that the ER+ patients have good prognosis and ER- negative patients have poor prognosis in the first five years after the diagnosis or surgery. Despite the caveats mentioned above, the results show that selectively merging those data sets which give rise to accurate predictors if used alone, can improve the performance. Moreover, our results confirm that the predictors based on large merged data sets are more robust, i.e. their worst performance observed in multiple iterations of cross-validation tends to be substantially better compared to the worst performance of the gene signatures based on the single data sets. This may be viewed as an advantage by itself. In general, the prediction accuracy of the gene signatures derived from the merged data sets remained consistent and reproducible across independent studies. Prediction accuracies measured in cross-validation were extensible to independent testing sets. The systematic evaluation of predictors built from the single and merged gene expression data sets also led us to the surprising observation that a single-gene signature consisting of CYB5D1 had the highest prediction accuracy and strongest patients risk association in breast cancer.