Algorithms Implemented in TAGster

TAGster: Efficient Selection of LD Tag SNP in Single or Multiple Populations

Consider a set S which contains M bi-allelic SNP markers a₁,a₂...,a_M in K populations

and S_i contains M_i SNP markers s_i1,s_i2,...,s_{iM_i} in population i. First, we estimated pairwise LD measure r² for each SNP pair within each population. Two markers s_im and s_in are said to be in strong LD if the r²(s_im,s_in) is greater than or equal to a pre-specified threshold value r₀. Both are considered tag SNP for each other, in that s_im can be used as a surrogate for s_in, or vice versa.

Our aim is to find a tag SNP set, denoted by T, such that for ∀s_im ∈S_i, i=1,...,K, ∃α_j ∈T that satisfies r²(α_j,S_im) ≥ r₀. In our presentation, we introduce intermediate SNP sets, P and Q_i, i = 1,...,K.

where, P_i is called the candidate set which contains all the SNPs in population i that are eligible to be chosen as a tag SNP, Q_i contains SNPs in population i that are already tagged by at least one of tag SNPs in T, i.e. ∀s_im ∈Q_i, _i = 1,...,K, ∃α_j ∈T that satisfies r²(α_j,S_im) ≥ r₀. We implemented several algorithms in TAGster to select tag SNP set T.

Algorithm 1: A greedy algorithm for single or multiple populations

1. Set T = ∅, P₁= S₁ and Q₁= ∅, for any 1 =1,...,K;

2. For each SNP α_j in P, calculate

Number of SNP in high LD with a specific SNP

If α_j ∈P₁
If α_j ∉P₁

3. Find the SNP α_max that has the highest

Total number of SNP in high LD with in all populations

, and add α_max to T. If α_max ∈P_i, add any SNP s_im in P_i with r²(α_max, s_im) ≥ r₀ to Q₁ and then exclude α_max from P_i;

4. Repeat Steps 2-3 until Q_i=S₁ for any 1=1,...,K;

Algorithm 2: An optimal solution for single population tag SNP

An exhaustive Search is performed within each population to find minimal number of population specific tag SNPs T_i for i= 1,...,K.

1. Set T_i = ∅ and P_i=S_i, for i=1,...,K;

2. Within population i, partition SNPs in P_i into disjoint precinct P_ij, j= 1,...,n, so that r²(s_im,s_in)<r₀ for any two SNPs s_im and s_in that belong to different precincts.

3. Within a precinct P_ij,

For any two SNPs s_im and s_in in precinct P_ij, if

,we exclude one with smaller

from precinct P_ij
Conduct an exhaustive search to find a set of minimum number of tag SNPs for SNPs in precinct P_ij and add these tag SNPs into T_i;

4. Repeat step (3) for each precinct

Algorithm 3: Two-stage solution for multi-populations

1. Conduct Algorithm 2 within each population to select a set of population specific tag SNPs T_i for i = 1,...,K;

2. Set T = ∅, P_i= S_i for i = 1,...,K;

3. For each SNP t_ij in T_i, find and SNP s_im (s_im ∈P_i and s_im ∉T_i) that satify r²(t_ij,S_im) ≥ r₀ and then add them as well as t_ij into LD bin B_ij and exclude them for P_i

4. With each LD bin B_ij, set T_ij= ∅. Find any SNP s_im in B_ij that satify r²(s_im,S_in) ≥ r₀ for any SNP s_in in B_ij, and then add s_im to T_ij;

5. Set

. For each SNP τ in P, l= 1,...,|P|, construct a one dimensional array A_l with K elements, where

6. Cluster SNPs in P so that any two SNPs τ_m and τ_n in a cluster satisfy

7. Set Ψ = ∅. Find one SNP τ_l in each cluster with maximum

and add it to Ψ.

8. Cluster SNPs in Ψ so that any two SNPs τ_m and τ_n in a cluster satisfy

9. For each cluster, set LD bin set B= ∅, record the LD bins in each population that can be tagged by any SNP in the cluster to B, and then conduct an exhaustive search to find a minimum set of tag SNPs in the cluster that can tag all LD bins in B. Add this set of SNPs to T.

National Institute of Environmental Health Sciences

Webcasts

Your Environment. Your Health.

Algorithms Implemented in TAGster

TAGster: Efficient Selection of LD Tag SNP in Single or Multiple Populations