Combined the ranks of gene expression values which are independent from normalization obtained with different probes

Systematic differences between published data sets may also result from different pre-processing steps applied by the authors. For instance, expression levels are sometimes expressed as absolute values, sometimes as log ratios with respect to a reference sample. To avoid bias resulting from preprocessing, Reyal et al restricted their Ergosterol studies to data sets generated with the same chip for which raw data were available, and re-processed all data sets prior to merging. Other studies used homogeneous or heterogeneous data sets, as pre-normalized in the original studies, and applied a so-called data integration method prior to data fusion. A data integration method serves to project expression values for the same gene onto comparable scales. Perhaps the simplest way to approximately achieve this goal is Z-score normalization. More advanced methods attempt to match data-set specific parameters of the expression value distributions between input sets. Data integration methods that have been used in similar studies before include: Distance Weighted Discrimination, Combatting Batch effects, disTran, Median Rank Score, Quantile Discretizing or Z-score transformation. Another necessary processing step in data merging consists of mapping microarray features to a catalogue of standard gene names. This in turn will result in the definition of the subset of common genes to be retained in the merged data set. Here, the term microarray Diacerein feature refers to a single hybridization probe, or a set of probes, for which the platform returns a single expression value. Commercially available microarrays often contain multiple features for the same gene. What makes the merging of data sets non-trivial is that different platforms refer to the same genes by different names. Note further that for the reasons outlined above, merging of data sets usually leads to a substantial reduction in the number of genes considered for downstream analysis. Important genes included in only a part of the input data sets may be lost. Some studies used UniGene ID to identify common genes between different data sets whereas other studies employed different databases such as RefSeq or Stanford Source database to match probes/probe sets to genes. Note further that some research teams used directly probe/clone identifiers or probe set IDs when merging only cDNA or Affymetrix data set collections, respectively. The latter studies might have preferred not collapsing features into genes in order to keep the same annotation as other studies to validate the same features. An additional reason to keep original feature IDs is to preserve a large number of features rather than a a smaller number of genes to make biological/ statistical inferences. Sohal and coworkers used both UniGene ID and RefSeq ID to make a comparison of common genes. They concluded that using UniGene IDs achieved slightly better results than using RefSeq IDs, with a small margin. In this study, we used our own resource CleanEx for mapping microarray features to gene names, a database specifically developed for this purpose. While some research projects merged the gene expression values in their original continuous representation. In these studies, ranking was used to predict a categorical outcome. Note that ranking methods replace the continuous values by discrete integer values which influences the choice of data integration method. While DWD and ComBat preserve the original representation of data, MRS, QD and disTran transform the data representation into discrete values.