Combining Pareto-Optimal Clusters using Supervised
Learning for Identifying Co-expressed Genes (Accepted in BMC Bioinformatics) Ujjwal Maulik Department of Computer Science &
Engineering, Jadavpur University, Kolkata 700032, India, drumaulik@cse.jdvu.ac.in Anirban Mukhopadhyay Department of Computer Science &
Engineering, University of Kalyani, Kalyani-741235, India, anirban@klyuniv.ac.in Sanghamitra Bandyopadhyay Machine Intelligence Unit, Indian
Statistical Institute, Kolkata-700108, India, sanghami@isical.ac.in |
|
|
DATA
SETS USED FOR EXPERIMENTS Yeast SporulationThis
data set consists of 6118 genes measured across 7 time points (0, 0.5, 2, 5,
7, 9 and 11.5 hours) during the sporulation process of budding yeast. The
data are then log-transformed. The Sporulation data set is publicly available
at the website http://cmgm.stanford.edu/pbrown/sporulation.
Among the 6118 genes, the genes whose expression levels did not change
significantly during the harvesting have been ignored from further analysis.
This is determined with a threshold level of 1.6 for the root mean squares of
the log2-transformed ratios. The resulting set consists of 474 genes. The
preprocessed and normalized data set is available here. Yeast Cell CycleThe
yeast cell cycle dataset was extracted from a dataset that shows the
fluctuation of expression levels of approximately 6000 genes over two cell
cycles (17 time points). Out of these 6000 genes, 384 genes have been
selected to be cell-cycle regulated. This data set is publicly available at
the following website: http://faculty.washington.edu/kayee/cluster.
The preprocessed and normalized data set is available here. Arabidopsis ThalianaThis
data set consists of expression levels of 138 genes of Arabidopsis Thaliana.
It contains expression levels of the genes over 8 time points viz., 15 min, 30 min, 60 min, 90 min, 3 hours, 6
hours, 9 hours, and 24 hours. It is available at http://homes.esat.kuleuven.be/_thijs/Work/Clustering.html. The
preprocessed and normalized data set is available here. Human Fibroblasts SerumThis
dataset contains the expression levels of 8613 human genes. The data set has
13 dimensions corresponding to 12 time points (0, 0.25, 0.5, 1, 2, 4, 6, 8,
12, 16, 20 and 24 hours) and one unsynchronized sample. A subset of 517 genes
whose expression levels changed substantially across the time points have
been chosen. The data is then log2-transformed. This data set can be
downloaded from http://www.sciencemag.org/feature/data/984559.shl.
The preprocessed and normalized data set is available here. Rat CNSThe
Rat CNS data set has been obtained by reverse transcription-coupled PCR to
examine the expression levels of a set of 112 genes during rat central
nervous system development over 9 time points. This data set is available at http://faculty.washington.edu/kayee/cluster. The preprocessed and normalized data set is available here. All the data sets are normalized so that each row has mean 0 and variance 1. |