EagleView reads multiple data files including the standard ACE genome assembly file and optional READS, EGL, and MAP files. These optional files contain additional information of a genome assembly and can be visualized on EagleView. EagleView was designed with user's needs in mind. All these files are in simple text formats, which make it easy for creating customized data files and visualizing user-defined features on EagleView. The detail text format for each type file is described in the following subsections.
When EagleView opens an ACE file, it will automatically load the associated optional files containing base quality/flow signal information, and genome features mapping information if these optional files exist in the same directory as defined by:
A READS file contains base quality with/without flow signal information of all reads in the corresponding ACE file. A EGL file is a Contig location index file for a READS file. EagleView loads quality and flow signal information only when both files exist.
A MAP file contains location mapping information of genome features (e.g., exon, intron, and cis-elements). There are can be multiple mapping files if the ACE file has multiple Contigs, one mapping file for each Contig.
EagleView uses a set of extension filenames to distinguish different type input files as given by the following table.
File type | Extension name |
genome assembly | ACE |
base quality/flow information | READS |
file location index of READS file | EGL |
genome feature map | MAP |
An extension filename must be either in lower case or upper case but not both. EagleView uses base filename to associate an assembly ACE file with other type files. The files with the same base filename of an ACE file are automatically associated with the ACE file. For example, if an ACE file is named as {prefix}.ace, other type files associated with it should be named as {prefix}.reads, {prefix}.egl, and {prefix}.{contigName}.map respectively, where {contigName} is the name of a contig in the ACE file. Note that a MAP file is associated with only one contig, so an ACE file of multiple contigs can have multiple MAP files, each for one contig. When loading a genome assembly ACE file, EagleView automatically loads the other type files associated with the ACE file.
The standard ACE file format for genome assembly is fully described in Philip Green's Consed documentation at http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt.
EagleView extracts information defined in the seven tags: AS CO BQ AF RD QA DS, and skip all other information. The following are the brief description of the seven tags and a simple example of ACE file.
AS {number of contigs} {total number of reads in ace file} CO {contig name} {# of bases} {# of reads in contig} {# of base segments in contig} {U or C} The U or C stands for a complemented or un-complemented strand BQ This starts the list of base qualities for the unpadded consensus bases. AF {read name} {C or U} {padded start consensus position} BS {padded start consensus position} {padded end consensus position} {read name} RD {read name} {# of padded bases} {# of whole read info items} {# of read tags} QA {qual clipping start} {qual clipping end} {align clipping start} {align clipping end} DS CHROMAT_FILE: {name of chromat file} PHD_FILE: {name of phd file} TIME: {date/time of the phd file} CHEM: {prim, term, unknown, etc} DYE: {usually ET, big, etc} TEMPLATE: {template name} DIRECTION: {fwd or rev}
A sample ACE File:
AS 1 3 CO ctgName 1475 8 156 U gatttcgggccgtggggttccttgagcactcccttagttccttcccagga... BQ 23 26 24 25 25 17 17 13 14 19 21 22 22 17 17 11 ... AF ID123526t U 205 AF ID123572c C 1 AF ID123766c C 310 BS 1 515 ID123572c BS 516 900 ID123526t BS 901 1475 ID123766c RD ID123572c 563 0 0 Gatttcgggccgtggggttccttgagcactcccttagttccttcccagga ... QA 19 349 19 424 DS CHROMAT_FILE: ID123572c PHD_FILE: ID123572c.phd.1 TIME: Fri Jan 12 10:40:10 2007 RD ID123526t 687 0 0 ccgtcctgagtggAGggcatggggcttggctggGCAAAGAGCTAACATAC ... QA 12 353 9 572 DS CHROMAT_FILE: ID123526t PHD_FILE: ID123526t.phd.1 TIME: Fri Jan 12 10:40:10 2007 RD ID123766c 517 0 0 TTtattaccggcgcggggttCcgTCGGAAAGGGAAATCAGCAAGAAGCTG ... QA 20 415 26 514 DS CHROMAT_FILE: ID123766c PHD_FILE: ID123766c.phd.1 TIME: Fri Jan 12 10:40:10 2007
The READS file is for storing quality/flow information of sequence reads for assembling a genome, where information is organized by Sections, eaching having one or more Records. A Section in READS file is corresponding to a Contig in its associated ACE file, and a Record is associated with a read in the Contig.
A Section consists of a section header and multiple Records. A Secotion starts with a section header in the format given by:
>contig_name num_of_read num_line_per_record
The tab delimiter is used to separate the three fields, whose definitions are given by:
contig_name - the name of contig num_of_read - the total number of reads in the contig num_line_per_record - the number of lines used to store read quality/flow information in a Record
Note: since a read ID occupy one line in a Record, the total number of line for a Record is num_line_per_record+1
The first line of a Record always begins with a read ID and Read_Type, which is followed by a line containing base qualities of the read. It can be followed by additional lines for other type information (e.g. flowgram signals of the reads). Other type information except quality information must be defined by a pair of lines, with indexes in the first line and values in the second lines. The total number of lines for the record should be 1+num_line_per_record as defined in the Section header. Again, the tab delimiter is used to separate individual fields/values in each lines. For example:
READID Read_Type Quality_Line Flow_Index_Line Flow_Value_Line
In the example, the last two lines 'IndexOneLine' and 'ValueOneLine' are optional.
EagleView currently supports two different read types represented by 454 flowgram signals and Illumina/Solexa four-color signals. EagleView uses the values 1 and 2 to indicate 454 reads (type 1 reads) and Illumina reads (type 2 reads). EagleView also supports mixed-type reads in one READS file for co-assembly of sequencing reads from different technologies.
A sample of READS file with both 454 and Illumina reads
>ctg1 2 3 DCYMEOY01A2GZ0 1 36 24 26 27 26 26 22 27 23 ... 8 8 10 12 13 15 18 21 24 ... 97 14 107 12 12 91 11 99 209 ... DCYMEOY01A2P6P 1 36 24 26 26 26 26 23 35 23 ... 9 9 11 12 13 15 18 19 19 ... 98 8 105 11 13 99 14 100 8 ... >ctg2 6 3 DCYMEOY01AFSP6 1 44 31 17 4 40 29 11 21 46 ... 9 9 9 9 12 12 12 13 16 ... 91 6 106 6 10 99 9 100 7 ... DCYMEOY01AGTAS 1 36 24 35 23 25 26 24 45 32 ... 9 9 11 11 14 15 17 20 20 ... 89 7 108 7 9 102 14 98 12 ... DCYMEOY01AH532 1 27 41 30 11 33 21 27 26 26 ... 10 12 12 12 15 15 18 19 20 ... 95 6 109 7 9 99 10 101 14 ... B_RUN1_7_168_884_609 2 30 30 30 30 30 30 30 30 30 ... 0 6 11 14 18 20 27 28 34 ... 7992.9 658.2 473.6 507.7 368.8 327.5 7055.2 1507.2 887.0 ... B_RUN1_7_168_506_389 2 30 30 30 30 30 30 30 30 30 ... 3 7 10 15 16 20 24 28 32 ... 445.2 183.9 132.3 8521.9 769.3 1116.8 774.4 14748.8 663.1 ... B_RUN1_1_135_384_892 2 30 30 30 30 30 30 30 30 30 ... 2 7 11 12 18 23 26 31 35 ... 537.0 282.5 9355.4 898.1 355.2 3014.5 409.3 5507.6 646.1 ...
EGL file stores file indexes of Section headers locations of the corresponding READS file. Each line in EGL file is in the format given by:
contig_name location_section num_reads num_line
Here location_section is the file index location in READS files pointing to the end of section header starting with contig_name; num_reads is the total number of reads in the section; and num_line, equivalent to num_line_per_record defined above, is the number of lines storing quality/flow information per record.
A sample of EGL file
ctg1 9 247 3 ctg2 301765 107 3 ctg3 439634 360 3 ctg4 904028 135 3 ctg5 107267 225 3
A MAP file usually contains two types information: feature class and feature location. Feature classes are to instruct how EagleView to present different classes, and feature locations tell where these features are in a genome. Either a feature class or a feature location is defined by a single line.
A line of a feature class always starts with '>', which is immediately followed by a class name and mapping instructions, which are separated by a tab. Each instruction consists a key and value pair separated only by '='. There are four different keys currently defined and each key represented by two upper cases letters. Each feature class should have at least one instruction and no more four instructions. The four type keys are:
BG - background color
FG - foreground color
SY - class name or class symbol represented by one word or
letter/symbol. Class name/symbol will be displayed in the viewer
PT - position type of feature locations with value either 0 or 1. The
type 0 stands for unpadded position and 1 for padded position
counting gaps/pads. The default value is 0.
Note: a color is defined in RGB format in the form #rrggbb, where r/g/b has a value between '0' and 'F'. EagleView also recognizes some color names for commonly used colors such as white, blue, black, cyan, green, navy, red, and yellow.
>class_name BG=#FF00FF FG=blue SY=anyName
A feature location consists at least a feature class name and start position. If the feature is of length more than 1 base, the stop position should also be provided. Fields are separated by the tab delimiter. The three valid feature entry formats are given by:
class_name begin_pos end_pos feature_name class_name begin_pos end_pos class_name position
A sample of MAP file
>SNP BG=white FG=red SY=+ >E BG=cyan SY=exon >I BG=yellow SY=intron >G BG=gray SY=gene PT=0 >R FG=#00FF00 SY=$ PT=0 SNP 3 SNP 76 E 15 25 the exon name I 100 125 G 72 141 the gene name R 23 47