SA-Modk-Prototypes for Simultaneous Clustering of Gene Expression Data with Clinical Chemistry and Pathological Evaluations using Simulated Annealing

Overview

The SA-Modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of samples. Simulated annealing is used to avoid local minima in search of the global solution. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster’s prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members.

Reference for Citing

Bushel PR, Clustering of Gene Expression Data and End-point Measurements by Simulated Annealing. Journal of Bioinformatics and Computational Biology 2008.

Data Types and Format

Gene expression data needs to be formatted (short and wide) in a tab delimited text file with array observations as row values and gene, clinical chemistry and histopathology variables as column values. The first row is the column header, the second row is an integer denoting the data type (1 = gene expression, 2 = clinical chemistry measurement, 3 = histopathology observation). The order of the data in the file should be from data type 3 to 2, to 1 and be within individual groups or blocks.

Limitations

Only one categorical feature value per observation is permitted. A feature can exist as only one type of data. The application is optimized for clustering the samples and identifying phenotypic prototypes from the groups of them, not of the genes. The application is not guaranteed to find the optimal solution for the clustering of the samples, just the assignment of the samples to clusters according to the reduction of an objective function close to the global minimum.

Requirements

SA-Modk-Prototypes is a set of Matlab functions and scripts tested in Matlab version 7.0.4.X.XX R14 for Windows (2000 and XP). You may encounter problems in other operating systems, platforms and/or other Matlab versions. The applications require the Matlab Statistics Toolbox Version 4.0, the Resampling Stats Toolbox Version 1.0 by Daniel T. Kaplan (Department of Mathematics and Computer Science, Macalester College, St. Paul, Minnesota, USA), the adjusted Rand Index function by Tijl De Bie (February 2003), the Matlab loadcell.m function to load mixed type data and the cell2csv.m function to convert cell arrays to comma separate value formatted files, both available at the Matlab Central File Exchange (File ID 1965 and 7601 respectively). Be sure to set the path of the Toolboxes in Matlab before running the application.

Downloads

Download the Matlab files and a stand-alone executable version of the program (101MB) . A demo script, ReadMe file and sample data are provided in the distribution to help get you started with using the application.

Public Domain Notice

This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.

Contact

Report bugs, corrections and suggestions to:

Pierre R. Bushel, Ph.D.
Special Volunteer
Tel 919-618-1945
[email protected]