Data Files and Formats

EagleView reads multiple data files including the standard ACE genome assembly file and optional READS, EGL, and MAP files. These optional files contain additional information of a genome assembly and can be visualized on EagleView. EagleView was designed with user's needs in mind. All these files are in simple text formats, which make it easy for creating customized data files and visualizing user-defined features on EagleView. The detail text format for each type file is described in the following subsections.

When EagleView opens an ACE file, it will automatically load the associated optional files containing base quality/flow signal information, and genome features mapping information if these optional files exist in the same directory as defined by:

Filename convention

EagleView uses a set of extension filenames to distinguish different type input files as given by the following table.

File type Extension name
genome assembly ACE
base quality/flow information READS
file location index of READS file EGL
genome feature map MAP

An extension filename must be either in lower case or upper case but not both. EagleView uses base filename to associate an assembly ACE file with other type files. The files with the same base filename of an ACE file are automatically associated with the ACE file. For example, if an ACE file is named as {prefix}.ace, other type files associated with it should be named as {prefix}.reads, {prefix}.egl, and {prefix}.{contigName}.map respectively, where {contigName} is the name of a contig in the ACE file. Note that a MAP file is associated with only one contig, so an ACE file of multiple contigs can have multiple MAP files, each for one contig. When loading a genome assembly ACE file, EagleView automatically loads the other type files associated with the ACE file.

ACE file format

The standard ACE file format for genome assembly is fully described in Philip Green's Consed documentation at http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt.

EagleView extracts information defined in the seven tags: AS CO BQ AF RD QA DS, and skip all other information. The following are the brief description of the seven tags and a simple example of ACE file.

ACE tag definitions

AS {number of contigs} {total number of reads in ace file}
CO {contig name} {# of bases} {# of reads in contig} {# of base segments in contig} {U or C}
The U or C stands for a complemented or un-complemented strand
BQ
This starts the list of base qualities for the unpadded consensus bases.
AF {read name} {C or U} {padded start consensus position}
BS {padded start consensus position} {padded end consensus position} {read name}
RD {read name} {# of padded bases} {# of whole read info items} {# of read tags}
QA {qual clipping start} {qual clipping end} {align clipping start} {align clipping end}
DS CHROMAT_FILE: {name of chromat file} PHD_FILE: {name of phd file} TIME: {date/time of the phd file} CHEM: {prim, term, unknown, etc} DYE: {usually ET, big, etc} TEMPLATE: {template name} DIRECTION: {fwd or rev}

A sample ACE File:

AS 1 3
CO ctgName 1475 8 156 U
gatttcgggccgtggggttccttgagcactcccttagttccttcccagga...

BQ
23 26 24 25 25 17 17 13 14 19 21 22 22 17 17 11 ...

AF ID123526t U 205
AF ID123572c C 1
AF ID123766c C 310

BS 1 515 ID123572c
BS 516 900 ID123526t
BS 901 1475 ID123766c

RD ID123572c 563 0 0
Gatttcgggccgtggggttccttgagcactcccttagttccttcccagga
...

QA 19 349 19 424
DS CHROMAT_FILE: ID123572c PHD_FILE: ID123572c.phd.1 TIME: Fri Jan 12 10:40:10 2007

RD ID123526t 687 0 0
ccgtcctgagtggAGggcatggggcttggctggGCAAAGAGCTAACATAC
...

QA 12 353 9 572
DS CHROMAT_FILE: ID123526t PHD_FILE: ID123526t.phd.1 TIME: Fri Jan 12 10:40:10 2007

RD ID123766c 517 0 0
TTtattaccggcgcggggttCcgTCGGAAAGGGAAATCAGCAAGAAGCTG
...

QA 20 415 26 514
DS CHROMAT_FILE: ID123766c PHD_FILE: ID123766c.phd.1 TIME: Fri Jan 12 10:40:10 2007

READS file format

The READS file is for storing quality/flow information of sequence reads for assembling a genome, where information is organized by Sections, eaching having one or more Records. A Section in READS file is corresponding to a Contig in its associated ACE file, and a Record is associated with a read in the Contig.

Section format

A Section consists of a section header and multiple Records. A Secotion starts with a section header in the format given by:

>contig_name num_of_read num_line_per_record

The tab delimiter is used to separate the three fields, whose definitions are given by:

contig_name - the name of contig
num_of_read - the total number of reads in the contig
num_line_per_record - the number of lines used to store read quality/flow information in a Record

Note: since a read ID occupy one line in a Record, the total number of line for a Record is num_line_per_record+1

Record format

The first line of a Record always begins with a read ID and Read_Type, which is followed by a line containing base qualities of the read. It can be followed by additional lines for other type information (e.g. flowgram signals of the reads). Other type information except quality information must be defined by a pair of lines, with indexes in the first line and values in the second lines. The total number of lines for the record should be 1+num_line_per_record as defined in the Section header. Again, the tab delimiter is used to separate individual fields/values in each lines. For example:

READID        Read_Type
Quality_Line
Flow_Index_Line
Flow_Value_Line

In the example, the last two lines 'IndexOneLine' and 'ValueOneLine' are optional.

Read Type

EagleView currently supports two different read types represented by 454 flowgram signals and Illumina/Solexa four-color signals. EagleView uses the values 1 and 2 to indicate 454 reads (type 1 reads) and Illumina reads (type 2 reads). EagleView also supports mixed-type reads in one READS file for co-assembly of sequencing reads from different technologies.

A sample of READS file with both 454 and Illumina reads

>ctg1	2	3
DCYMEOY01A2GZ0	1
36	24	26	27	26	26	22	27	23	...
8	8	10	12	13	15	18	21	24	...
97	14	107	12	12	91	11	99	209	...
DCYMEOY01A2P6P	1
36	24	26	26	26	26	23	35	23	...
9	9	11	12	13	15	18	19	19	...
98	8	105	11	13	99	14	100	8	...
>ctg2	6	3
DCYMEOY01AFSP6	1
44	31	17	4	40	29	11	21	46	...
9	9	9	9	12	12	12	13	16	...
91	6	106	6	10	99	9	100	7	...
DCYMEOY01AGTAS	1
36	24	35	23	25	26	24	45	32	...
9	9	11	11	14	15	17	20	20	...
89	7	108	7	9	102	14	98	12	...
DCYMEOY01AH532	1
27	41	30	11	33	21	27	26	26	...
10	12	12	12	15	15	18	19	20	...
95	6	109	7	9	99	10	101	14	...
B_RUN1_7_168_884_609	2
30	30	30	30	30	30	30	30	30	...
0	6	11	14	18	20	27	28	34	...
7992.9	658.2	473.6	507.7	368.8	327.5	7055.2	1507.2	887.0 ...
B_RUN1_7_168_506_389	2
30	30	30	30	30	30	30	30	30	...
3	7	10	15	16	20	24	28	32	...
445.2	183.9	132.3	8521.9	769.3	1116.8	774.4	14748.8	663.1 ...
B_RUN1_1_135_384_892	2
30	30	30	30	30	30	30	30	30	...
2	7	11	12	18	23	26	31	35	...
537.0	282.5	9355.4	898.1	355.2	3014.5	409.3	5507.6	646.1 ...

EGL file format

EGL file stores file indexes of Section headers locations of the corresponding READS file. Each line in EGL file is in the format given by:

contig_name location_section num_reads num_line

Here location_section is the file index location in READS files pointing to the end of section header starting with contig_name; num_reads is the total number of reads in the section; and num_line, equivalent to num_line_per_record defined above, is the number of lines storing quality/flow information per record.

A sample of EGL file

ctg1	9	247	3
ctg2	301765	107	3
ctg3	439634	360	3
ctg4	904028	135	3
ctg5	107267	225	3

MAP file format

A MAP file usually contains two types information: feature class and feature location. Feature classes are to instruct how EagleView to present different classes, and feature locations tell where these features are in a genome. Either a feature class or a feature location is defined by a single line.

Feature Class Format

A line of a feature class always starts with '>', which is immediately followed by a class name and mapping instructions, which are separated by a tab. Each instruction consists a key and value pair separated only by '='. There are four different keys currently defined and each key represented by two upper cases letters. Each feature class should have at least one instruction and no more four instructions. The four type keys are:

BG - background color
FG - foreground color
SY - class name or class symbol represented by one word or letter/symbol. Class name/symbol will be displayed in the viewer
PT - position type of feature locations with value either 0 or 1. The type 0 stands for unpadded position and 1 for padded position counting gaps/pads. The default value is 0.

Note: a color is defined in RGB format in the form #rrggbb, where r/g/b has a value between '0' and 'F'. EagleView also recognizes some color names for commonly used colors such as white, blue, black, cyan, green, navy, red, and yellow.

	>class_name BG=#FF00FF FG=blue SY=anyName

Feature Location Format

A feature location consists at least a feature class name and start position. If the feature is of length more than 1 base, the stop position should also be provided. Fields are separated by the tab delimiter. The three valid feature entry formats are given by:

class_name	begin_pos	end_pos	feature_name
class_name	begin_pos	end_pos
class_name	position

A sample of MAP file

>SNP	BG=white	FG=red	SY=+
>E	BG=cyan	SY=exon
>I	BG=yellow	SY=intron
>G	BG=gray	SY=gene	PT=0
>R	FG=#00FF00	SY=$	PT=0
SNP	3
SNP	76
E	15	25	the exon name
I	100	125
G	72	141	the gene name
R	23	47