Systematic Variation Normalization (SVN)
Overview
Systematic variation normalization (SVN) is a procedure for removing systematic variation in microarray gene expression data. Based on an analysis of how systematic variation contributes to variability in microarray data sets, the SVN procedure includes background subtraction determined from the distribution of pixel intensity values and log conversion, linear or nonlinear regression, restoration or transformation, and multiarray normalization.
In the case of when a nonlinear regression is required, an empirical polynomial approximation approach is used. Either the high terminated points or their averaged values in the distributions of the pixel intensity values observed in control channels may be used for rescaling multiarray datasets. These preprocessing steps remove systematic variation in the data attributable to variability in microarray slides, assaybatches, the array process, or experimenters. Biologically meaningful comparisons of gene expression patterns between control and test channels or among multiple arrays are therefore unbiased using normalized datasets.
Method
The first step in the normalization is to subtract b from measured intensity Ii, i.e.
Ii,b.sub = Ii  b = a ecGi,
where Ii,b.sub is the total background b subtracted intensity. Applying logarithm operation to both sides in Eq. (1) results in
log (Ii,b.sub) = log a + c Gi.
If a and c are constants, we have log (Ii,b.sub) linearly proportional to gene expression Gi. In a twochannel cDNA or oligo microarray experiment, total or polyRNA samples from control or treated channel are labeled either with Cy3 or Cy5, mixed, competitively hybridized to the array and excited by different incident laser beams. The data from the two channels bear with different systematic variations. So, we have two equations to represent the two channel data sets,
 log (Ii,b.sub,t) = log at + ct Gi,t. and
log (Ii,b.sub,c) = log ac + cc Gi,c,
where subscripts t and c represent treated and control channels, respectively. Although the examples presented are from twochannel microarray data sets, Eqs. (3) and (4) are actually independent from each other. Therefore, in principle, it also is applicable to onechannel data sets, as well as the two onechannel data sets form a biologically meaningful pair. Since systematic variation a and c are arbitrary constants, a linear transformation applied to either Eq. (3) or (4) results in another equivalent data set upon the carried gene expression Gi. Subtracting Eq. (3) by Eq. (4), we yield,
log (Ii,b.sub,t) – log (Ii,b.sub,c) = log at – log ac + ct Gi,t  cc Gi,c.
Plotting the two data sets of log (Ii,b.sub,t) and log (Ii,b.sub,c) in Eqs. (3) and (4) give a 2dimensional (2D) scatter representation of the data. If both a and c are constants, we can apply a linear regression to the data in its 2D scatter representation. The linear regression provides two conditions,
 log at  log ac = 0, and
ct / cc = 1.
Given the two conditions used as adjudicators (reference points), we make the linear transformations to both data sets, such that a new regression line over the transformed data sets passes through the origin of the 2D scatter plot with a slope of one. Combining with the conditions of Eqs. (6) and (7), Eq. (5) becomes
log (Ii,b.sub,t)  log (Ii,b.sub,c) = c (Gi,t  Gi,c) = c ∆Gi,
where a is canceled (see Eq. (6)) and coefficient c = ct = cc (see Eq. (7)). ∆Gi (= Gi,t  Gi,c) represents a change in gene expression of gene i between test and control samples. The left side in Eq. (8) mathematically equals log (Ii,b.sub,t / Ii,b.sub,c), i.e. the socalled log ratio. However, taking intensity ratio between the two channels and then applying logarithm conversion manifest a questionable biological representation. As can be seen from the above procedure, the ratiobased normalization approaches do not treat systematic variation a and c individually, therefore the resultant gene expression values are still biased.
From Eq. (1) to Eq. (8), we propose a twochannel array normalization procedure. As shown in Eq. (8), the resultant data now carries with it an arbitrary systematic variation component c. When there are multiple twochannel microarray data sets and each data set has a different c, socalled multiarray normalization rescales each c to be a same constant C to bring multiarray data sets comparable. Now, among multiarray data sets, we have one common and arbitrary systematic variation component C left. This arbitrary constant represents a conversion relationship between the measured log intensity (log (Ii,b.sub,t) or log (Ii,b.sub,c)) and gene expression (Gi,t or Gi,c).
As indicated above, upon a linear regression, we assume that systematic variation factors a, b and c are constants. In most microarray data sets we have analyzed, linear regression usually is adequate. However, some microarray data sets show a nonlinear distribution in their 2D scatter representation of log (Ii,b.sub,t) verse log (Ii,b.sub,c). In a wellcalibrated detection system, if the exponential dependence of Fi,g = ec G i, where c is the systematic variation and G represents gene expression, represents a proper relationship, the nonlinear distribution is predominantly caused by samplerelated systematic variation c, i.e.
c = c(Gi).
In other words, systematic variation c is a function of the gene expression Gi. If we know its dependence, we can leverage it in the normalization procedure accordingly. Otherwise, we may use an empirical relationship. A few intensity depended approaches have been proposed in literature. In our data analysis practice, we find that a simple polynomial approach is satisfactory to most nonlinear distributed data sets we have encountered. From Eq. (9), we may take a polynomial expansion of (GGave), if we assume the expansion converges, where Gave is an average of Gi over all the samples spotted on an array. After regrouping each of the constants in the expansion, we can express a polynomial expansion as following,
c = c0 + c1G + c2G2 + …,
where ci are constants and i = 0, 1, 2, …. If all ci from high order terms equal zero (i.e. i ³2), we are back to the linear case. In practice, depending on the distribution of a data set, we may take the first few terms as a polynomial approximation in regression. Then, based on the regression, the data from the two channels are transformed accordingly.
Reference
Chou JW, Paules RS, Bushel PR. Systematic variation normalization in microarray data to get gene expression comparison unbiased. J Bioinform Comput Biol. 2005 Apr;3(2):22541. PMID: 15852502
Requirements
The software has been tested to run on Windows PCs running the 2000 and XP operating systems and requires JRE version 1.4.2
Downloads
 Download the SVN archive(11MB)
 Download the Quick Start Guide(45KB)
 Report bugs, corrections and suggestions to chou@niehs.nih.gov
Public Domain Notice
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.
Contact
 Pierre R. Bushel, Ph.D.

Tel (919) 3164564
Fax (919) 5414311
bushel@niehs.nih.gov
 Richard S. Paules, Ph.D.

Tel (919) 5413710
Fax (301) 4803182
paules@niehs.nih.gov