Alu Pairs Database
Materials & Initial Data Sources
This map file is a subfile of repbase(Genetic Information Research Institute(GIRI)), derived by comparing genomic sequences in the GenBank database(release 112.0, National center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD) with the Alu consensus sequence, was made in April 1999 and is maintained at the GIRI), Sunnyvale, CA). The Alu map includes the following information in columnar form: locus, beginning sequence position, ending sequence position, fragment start relative to repeat consensus, type of Alu sequence, the fragment start relative to repeat consensus, the fragment end relative to repeat consensus, the orientation of the sequence (D, denoting direct, versus C, denoting complementary), the percent similarity to the Alu consensus sequence, the ratio of mismatches to matches, and the alignment score. The score column, not required for further manipulations, was removed. Also, to facilitate further processing, the names of each of the Alu subfamilies were stripped down to simply "Alu" by using the UNIX cut utility ($ cut -c1-30,40-75).
The program PLEN was used to calculate the length of the Alu sequence, by subtracting column2 from column3 and adding 1. This length was pasted to the front of the map file so that the loci could be sorted by length. All Alu fragments less than 50 nucleotides long were excluded from the analysis. This left 151,695 alu sequence fragments listed in a file named "alu".
The program VSUB2 [written by Paul Klonowski, GIRI] was used to make a working sublibrary of sequence files (with all their accompanying annotations) for all the loci listed in the map file. It is necessary to make a source file from the map file, by deleting all the fields except the first field (The locus name)and removing all redundancies with the UNIX cut and uniq utilities. The needed sequence files were extracted from the GenBank release 112.0 (June 15, 1999). This Alu sublibrary was needed in order to generate alignments between the Alu pairs. VSUB2 and all other binary program files used in this project were previously compiled for use on the SUN Sparc station running SunOS 5.5 at GIRI. All manipulations took place remotely at GIRI through telnet (National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign).
Subcategorization of Data according to their Relative Orientation
The revised map file was used as input to derive a list of loci and their corresponding coordinates for each set of two adjacent Alu sequence elements (an Alu pair). The program PFOLLOWS3 [ibid. Klonowski], which locates adjacent pairs, handled this task by taking the input file, minimum acceptable distance, and maximum acceptable distance as parameters. The separation distances selected ranged between 0 and 650 nucleotides. These initial distances represented the 'c' distance, or the distance between one Alu consensus sequence and the next. The PFOLLOWS3 output contained four lines, one for each of the four possible orientation permutations for a pair and each with several thousand characters. Direct repeats could be in the complementary - complementary (CC) orientation (<-- <--) or the direct-direct (DD) orientation (--> -->). The indirect repeats could be CD-oriented indirect repeats, which are oriented such that the 5 prime ends, as related to the direction of transcription of active Alus, are towards each other (<-- -->), or DC oriented inverted repeats (--> <--). All pairs in the CC (complementary-complementary) orientations are listed on the PFOLLOWS3 first line of the output, followed by all the pairs in the CD (complementary-direct) orientation, the DC orientation and the DD orientation. Each line begins with the orientation, gives the number of pairs of that particular orientation and lists all of the pairs of the denoted orientation by providing the locus's identification number, immediately followed by parentheses containing the start and end sequence coordinates separated by a comma.
Four simple programs written in the Practical Extraction and Report Language (Perl [Wall, 1991]) were run by an executable script to extract the coordinates for each of the four types of pairs (CC, CD, DC, DD) from the PFOLLOWS3 output file. There were a total of 70,324 Alu pairs that were separated by less than 650 nucleotides: (23,318 dd; 22,769 cc; 12,188 cd; 12,049 dc).The four subfiles were reformatted to place the pair coordinates side-by-side in list format for subsequent reference and analysis. Multiple spaces were first reduced to single spaces and were eventually replaced by the new line character (\n) in batch through executable shell scripts using the UNIX translate (tr) utility [Rosen, 1990]. These coordinate list files were then used to extract of the sequences from the GenBank-derived sublibrary through the use of the program VEXT[ibid. Klonowski].
Extraction of Alu Sequence Fragments
Sequences were extracted for both Alu elements in each pair. For eventual use of these data as an on-line reference, the entire region was extracted between the first nucleotide in the first Alu to the last nucleotide of the adjacent Alu in the pair as a separate file. These three extractions were carried out as multi-step process that among other operations, calls the GIRI programs, VFLOP and VEXT. These steps are automated by a UNIX shell script used with the executable file, Extract_pairs.
Because multiple Alu pairs frequently were found at the same locus, it was necessary to be able to easily distinguish between pairs rather than relying exclusively on the unique coordinates for identification. Therefore, before the pairs were aligned, the loci's names were made unique by the appending of a suffix onto the original name through a function of the program PPLAN(PLAN, Version 2.0 [ibid. Klonowski]). This task was facilitated by the shell script rename and the batch file Make_uniq.
Sequence Alignments Between the First and Second Alus in Each Pair
For the indirect repeats (the CD and DC Alu pairs) the complementary sequence of the second Alu in each pair was generated using the program PCOMP1 [Klonowski, 1997] before alignment in batch. The Alu pairs that were direct repeats (CC and DD) were aligned with each other; for the indirect repeats, the first Alu in each of the pairs was aligned with the complement of the second Alu element using the PFLANK3 program [Klonowski, 1997]. The PFLANK3 program was designed especially for finding flanking repeats but had the undesirable effect of removing the original coordinates from the map and renumbering the sequences. As with most alignment algorithms, it produced no flanking gaps. Therefore, if gaps existed they were within the alignment. The output included the renumbered start and finish sites for both the top and bottom sequences in the alignment; the actual aligned sequences, with asterisks denoting matches; a colon denoting an aligned pair of either purines of pyrimidines; and a single dot denoting a purine paired with a pyrimidine. The output included, among other things, the numbers of matches, mismatches, internal gaps and a similarity score.
Retrieval of Coordinates
Once the sequences were aligned, the actual aligned coordinate numbers (as opposed to the renumbered coordinates) were regenerated using the short program PRENUM02 [Klonowski, 1998], which inputs the following parameters: the PFLANK3 alignment output file that contains the renumbered alignment sequence alignment, the sequence file that contains the original data that contributed to the top sequence of the alignment file, the sequence file that contributed to the bottom sequence of the alignments, and an output filename. In the case of the inverted repeats, the complementary file coordinates that were inverted were used as input and needed to be reverted at a later step. Also the complementary strands were appended with the suffix @2 which needed to be stripped prior to alignment. This was done using the GNUemacs replace-string command. Once the changes were made. The retrieval of coordinates was run in batch by execution of the command Get_real_align_coords. These files were renamed with the file extension ".align". Summary data were extracted through execution of the UNIX grep [Rosen, 1990] command twice, first to get the recovered alignment coordinates and then to retrieve the alignment statistics that were contained on different lines in the alignment output. These files were pasted so that all the pertinent information was on one line.
The next step was to use the Perl script Reformat_grep by running a batch file. In addition to simply reformatting the data, Reformat_grep verified that the loci's names match after the columns were pasted, calculates the alignment length ('a') and recalculated the percentage identity. The program calculates 'a' by subtracting the beginning of the aligned sequence from the end of the aligned sequence and adding one, for both the first and second sequences used. Whichever was larger was chosen for length. This algorithm, therefore, considers gaps in calculating the length of the alignment. It then determines the percentage identity by dividing the number of matches by the alignment length and multiplying by 100. The script also simply extracts the similarity score from the PFLANK3 output. The total length of the sum of the gaps was calculated by the addition of the number of mismatches and matches and, in turn, subtraction of that value from 'a.' The 'b+c' (internal flanks plus spacer distance) value was generated by the subtraction of the end of the prenumbered alignment coordinate of the first Alu in the pair from the beginning of the prenumbered alignment coordinate in the pair. Finally, the number of occurrences of gaps per alignment is extracted from the alignment output.
To recover the sequence description headings and the original unaligned sequence fragment coordinates from the sequence files that were used in generating the data, we wrote two small Perl programs, called coordinates and get_descrip. Both programs could be run by a single shell script, run in batch, that piped the extracted coordinates into the program VFLOP for rearrangement. Once the files containing the coordinates for first and second pairs, and the description were generated, they were pasted, along with the 'grepped' alignment statistics, side-by-side, using the UNIX paste command. This resulted in a data file with all the above information forming fields separated by blanks in each record.
Retrieval of Alu family names for the fragments
To add the family and subfamily names to each record in the datafile, it was necessary to retrieve the information from the map file. The map file and the data files are relational databases in that the unappended loci name and the beginning and ending coordinates for the Alu fragment are common fields to both databases. It was, therefore possible to extract the family names from the map file and append it to the record. This was done by a PERL program called find_family.
Obtaining the 'a' Length and Subgrouping the Data into Bins According to Percent Identity Between Paired Alus
A Perl program, Bins, subdivided the files according to the range in which the 'a' length fell.(It is called by running the Batch file Sort_by_length. In the process, the Bins program verified that the loci's names in the pasted coordinate data matched the alignment data. The Bins program also recalculated the 'c' size by subtracting the end of the first original Alu fragment coordinate from the end of the second original Alu fragment coordinate as an internal check. These data agreed with the PFOLLOWS3 output discussed above. "B" is determined by the subtraction of 'c' from 'b+c'giving the combined flanking Alu sequences that are in-between but not included in the alignment. Another Perl program called BinsBC was used to further subdivide the summary output from the alignment into different groupings dependent on the 'b+c' distance separating the two Alus in the pairs. This program was run in batch also. For the last subcategorization, the Bins3 program was developed and used to group the data into eight percentile bins based on the percent identity between the two Alu sequences in the pair. The percent_batch executable batch files were used to run the Bins program on all of the pasted files that contained all the alignment data.
Finally, to simplify the presentation of the data as four orientation based tables the task was automated as follows: themake_html batch files invoke the html_processor a shell script that calls the Data_to_html file and shortens the description file by removing the species name.Data_to_html converts the text based information to html and the the tables are automatically constructed using the program auto_tables.