Systematic Variation Normalization (SVN)

Overview

Systematic variation normalization (SVN) is a procedure for removing systematic variation in microarray gene expression data. Based on an analysis of how systematic variation contributes to variability in microarray data sets, the SVN procedure includes background subtraction determined from the distribution of pixel intensity values and log conversion, linear or non-linear regression, restoration or transformation, and multiarray normalization.

In the case of when a non-linear regression is required, an empirical polynomial approximation approach is used. Either the high terminated points or their averaged values in the distributions of the pixel intensity values observed in control channels may be used for rescaling multiarray datasets. These pre-processing steps remove systematic variation in the data attributable to variability in microarray slides, assay-batches, the array process, or experimenters. Biologically meaningful comparisons of gene expression patterns between control and test channels or among multiple arrays are therefore unbiased using normalized datasets.

Method

The first step in the normalization is to subtract b from measured intensity Ii, i.e.

Ii,b.sub = Ii - b = a ecGi,

where Ii,b.sub is the total background b subtracted intensity. Applying logarithm operation to both sides in Eq. (1) results in
log (Ii,b.sub) = log a + c Gi.

If a and c are constants, we have log (Ii,b.sub) linearly proportional to gene expression Gi. In a two-channel cDNA or oligo microarray experiment, total or poly-RNA samples from control or treated channel are labeled either with Cy3 or Cy5, mixed, competitively hybridized to the array and excited by different incident laser beams. The data from the two channels bear with different systematic variations. So, we have two equations to represent the two channel data sets,
log (Ii,b.sub,t) = log at + ct Gi,t. and
log (Ii,b.sub,c) = log ac + cc Gi,c,

where subscripts t and c represent treated and control channels, respectively. Although the examples presented are from two-channel microarray data sets, Eqs. (3) and (4) are actually independent from each other. Therefore, in principle, it also is applicable to one-channel data sets, as well as the two one-channel data sets form a biologically meaningful pair. Since systematic variation a and c are arbitrary constants, a linear transformation applied to either Eq. (3) or (4) results in another equivalent data set upon the carried gene expression Gi. Subtracting Eq. (3) by Eq. (4), we yield,
log (Ii,b.sub,t) – log (Ii,b.sub,c) = log at – log ac + ct Gi,t - cc Gi,c.

Plotting the two data sets of log (Ii,b.sub,t) and log (Ii,b.sub,c) in Eqs. (3) and (4) give a 2-dimensional (2-D) scatter representation of the data. If both a and c are constants, we can apply a linear regression to the data in its 2-D scatter representation. The linear regression provides two conditions,
log at - log ac = 0, and
ct / cc = 1.

Given the two conditions used as adjudicators (reference points), we make the linear transformations to both data sets, such that a new regression line over the transformed data sets passes through the origin of the 2-D scatter plot with a slope of one. Combining with the conditions of Eqs. (6) and (7), Eq. (5) becomes
log (Ii,b.sub,t) - log (Ii,b.sub,c) = c (Gi,t - Gi,c) = c ∆Gi,

where a is canceled (see Eq. (6)) and coefficient c = ct = cc (see Eq. (7)). ∆Gi (= Gi,t - Gi,c) represents a change in gene expression of gene i between test and control samples. The left side in Eq. (8) mathematically equals log (Ii,b.sub,t / Ii,b.sub,c), i.e. the so-called log ratio. However, taking intensity ratio between the two channels and then applying logarithm conversion manifest a questionable biological representation. As can be seen from the above procedure, the ratio-based normalization approaches do not treat systematic variation a and c individually, therefore the resultant gene expression values are still biased.

From Eq. (1) to Eq. (8), we propose a two-channel array normalization procedure. As shown in Eq. (8), the resultant data now carries with it an arbitrary systematic variation component c. When there are multiple two-channel microarray data sets and each data set has a different c, so-called multiarray normalization rescales each c to be a same constant C to bring multiarray data sets comparable. Now, among multiarray data sets, we have one common and arbitrary systematic variation component C left. This arbitrary constant represents a conversion relationship between the measured log intensity (log (Ii,b.sub,t) or log (Ii,b.sub,c)) and gene expression (Gi,t or Gi,c).

As indicated above, upon a linear regression, we assume that systematic variation factors a, b and c are constants. In most microarray data sets we have analyzed, linear regression usually is adequate. However, some microarray data sets show a non-linear distribution in their 2-D scatter representation of log (Ii,b.sub,t) verse log (Ii,b.sub,c). In a well-calibrated detection system, if the exponential dependence of Fi,g = ec G i, where c is the systematic variation and G represents gene expression, represents a proper relationship, the non-linear distribution is predominantly caused by sample-related systematic variation c, i.e.
c = c(Gi).

In other words, systematic variation c is a function of the gene expression Gi. If we know its dependence, we can leverage it in the normalization procedure accordingly. Otherwise, we may use an empirical relationship. A few intensity depended approaches have been proposed in literature. In our data analysis practice, we find that a simple polynomial approach is satisfactory to most non-linear distributed data sets we have encountered. From Eq. (9), we may take a polynomial expansion of (G-Gave), if we assume the expansion converges, where Gave is an average of Gi over all the samples spotted on an array. After regrouping each of the constants in the expansion, we can express a polynomial expansion as following,
c = c0 + c1G + c2G2 + …,

where ci are constants and i = 0, 1, 2, …. If all ci from high order terms equal zero (i.e. i ³2), we are back to the linear case. In practice, depending on the distribution of a data set, we may take the first few terms as a polynomial approximation in regression. Then, based on the regression, the data from the two channels are transformed accordingly.

Reference

Chou JW, Paules RS, Bushel PR. Systematic variation normalization in microarray data to get gene expression comparison unbiased. J Bioinform Comput Biol. 2005 Apr;3(2):225-41. PMID: 15852502

Requirements

The software has been tested to run on Windows PCs running the 2000 and XP operating systems and requires JRE version 1.4.2

Downloads

Download the
Quick Start Guide
Report bugs, corrections and suggestions to SVN archive (11MB)

Public Domain Notice

This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.

Contact

Robert P. Bushel, Ph.D. Special Volunteer: Tel 919-618-1945
[email protected]

National Institute of Environmental Health Sciences

Webcasts

Your Environment. Your Health.