|
Combining Pareto-Optimal Clusters using Supervised
Learning for Identifying Co-expressed Genes (Accepted in BMC Bioinformatics) Ujjwal Maulik Department of Computer Science &
Engineering, Jadavpur University, Kolkata 700032, India, drumaulik@cse.jdvu.ac.in Anirban Mukhopadhyay Department of Computer Science &
Engineering, University of Kalyani, Kalyani-741235, India, anirban@klyuniv.ac.in Sanghamitra Bandyopadhyay Machine Intelligence Unit, Indian
Statistical Institute, Kolkata-700108, India, sanghami@isical.ac.in |
|
ABSTRACT Background The landscape of biological and biomedical research is being changed
rapidly with the invention of microarrays which enables simultaneous view on
the transcription levels of a huge number of genes across different
experimental conditions or time points. Using microarray data sets,
clustering algorithms have been actively utilized in order to identify groups
of co-expressed genes. This article poses the problem of fuzzy clustering in
microarray data as a multiobjective optimization problem which simultaneously
optimizes two internal fuzzy cluster validity indices to yield a set of
Pareto-optimal clustering solutions. Each of these clustering solutions
possesses some amount of information regarding the clustering structure of
the input data. Motivated by this fact, a novel fuzzy majority voting
approach is proposed to combine the clustering information from all the
solutions in the resultant Pareto-optimal set. This approach first identifies
the genes which are assigned to some particular cluster with high membership
degree by most of the Pareto-optimal solutions. Using this set of genes as
the training set, the remaining genes are classified by a supervised learning
algorithm. In this work, we have used a Support Vector Machine (SVM)
classifier for this purpose. Results The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes. Conclusions The proposed clustering method has
been shown to perform better than other well-known clustering algorithms in
finding clusters of co-expressed genes efficiently. The clusters of genes
produced by the proposed technique are also found to be biologically
significant, i.e., consist of genes which belong to the same functional
groups. This indicates that the proposed clustering method can be used
efficiently to identify co-expressed genes in microarray gene expression data. Availability: http://anirbanmukhopadhyay.50webs.com/mogasvm.html. Contact: anirban@klyuniv.ac.in Full text:
The complete article is available here. |