Combining Pareto-Optimal Clusters using Supervised Learning for Identifying Co-expressed Genes

 

(Accepted in BMC Bioinformatics)

 

Ujjwal Maulik

Department of Computer Science & Engineering, Jadavpur University, Kolkata 700032, India, drumaulik@cse.jdvu.ac.in

 

Anirban Mukhopadhyay

Department of Computer Science & Engineering, University of Kalyani, Kalyani-741235, India, anirban@klyuniv.ac.in

 

Sanghamitra Bandyopadhyay

Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108, India, sanghami@isical.ac.in

 

 

Abstract

 

Data Sets

 

Code

 

DATA SETS USED FOR EXPERIMENTS

 

Yeast Sporulation

This data set consists of 6118 genes measured across 7 time points (0, 0.5, 2, 5, 7, 9 and 11.5 hours) during the sporulation process of budding yeast. The data are then log-transformed. The Sporulation data set is publicly available at the website http://cmgm.stanford.edu/pbrown/sporulation. Among the 6118 genes, the genes whose expression levels did not change significantly during the harvesting have been ignored from further analysis. This is determined with a threshold level of 1.6 for the root mean squares of the log2-transformed ratios. The resulting set consists of 474 genes. The preprocessed and normalized data set is available here.

 

Yeast Cell Cycle

The yeast cell cycle dataset was extracted from a dataset that shows the fluctuation of expression levels of approximately 6000 genes over two cell cycles (17 time points). Out of these 6000 genes, 384 genes have been selected to be cell-cycle regulated. This data set is publicly available at the following website: http://faculty.washington.edu/kayee/cluster. The preprocessed and normalized data set is available here.

 

 

Arabidopsis Thaliana

This data set consists of expression levels of 138 genes of Arabidopsis Thaliana. It contains expression levels of the genes over 8 time points viz., 15 min, 30 min, 60 min, 90 min, 3 hours, 6 hours, 9 hours, and 24 hours. It is available at http://homes.esat.kuleuven.be/_thijs/Work/Clustering.html. The preprocessed and normalized data set is available here.

 

Human Fibroblasts Serum

This dataset contains the expression levels of 8613 human genes. The data set has 13 dimensions corresponding to 12 time points (0, 0.25, 0.5, 1, 2, 4, 6, 8, 12, 16, 20 and 24 hours) and one unsynchronized sample. A subset of 517 genes whose expression levels changed substantially across the time points have been chosen. The data is then log2-transformed. This data set can be downloaded from http://www.sciencemag.org/feature/data/984559.shl. The preprocessed and normalized data set is available here.

 

 

Rat CNS

The Rat CNS data set has been obtained by reverse transcription-coupled PCR to examine the expression levels of a set of 112 genes during rat central nervous system development over 9 time points. This data set is available at http://faculty.washington.edu/kayee/cluster. The preprocessed and normalized data set is available here.

 

All the data sets are normalized so that each row has mean 0 and variance 1.