Next Previous Contents

4. Marker data format

A marker input data file normally consists of a single header line, followed by multiple marker data records, with one line of marker data per person. The ASPEX programs handle several file format variants by automatically determining the format of the input data. There are two basic data formats: the ``basic'' or ``Risch'' format, with pedigree information implied by line ordering, and pre-makeped LINKAGE format, with explicit pedigree information. The Risch format is supported mainly for historical reasons, and we discourage its use because the extra information in a LINKAGE file can help catch some kinds of errors.

The header line consists of a space-separated list of marker names for which this file contains data.

Each record of the basic Risch marker data format looks like:

[fid] [pid] [1a] [1b] [2a] [2b] [3a] [3b] ...

or if a disease status field is available:

[fid] [pid] [sick] [1a] [1b] [2a] [2b] [3a] [3b] ...

where [fid] is a unique identifier for this family, [pid] is an identifier for this person, [sick] indicates if the person is sick or healthy, and the rest of the fields are pairs of allele identifiers. Family, person, and allele identifiers can be arbitrary numbers or strings of up to 15 characters. The family and person identifiers may be separated by a period. Other fields must be separated by blank space (either spaces or tabs).

For [sick], values of `Y', 'y', `T', 't', and 2 indicate the person is affected, and 'N', 'n', 'F', 'f', and 1 indicate unaffected. If the value matches '?', 'U', 'u', or the blank allele, the person will be considered to have unknown disease status. These mappings can be changed using the sick_set, well_set, and unk_set parameters. Siblings with unknown disease status will never be included in a sib pair, but will be used for reconstructing missing parents.

Within a group of records for a complete family, the first record should describe the father, followed by the mother, followed by the children. Blank lines are ignored, and spacing within a line is not important.

Here is a sample marker data file for a single family, with four marker positions, with a column indicating disease status:

mark1 mark2 mark3 mark4
001 1  0  1 1  1 2  1 3  2 2
001 2  0  0 0  3 4  2 3  2 4
001 3  2  1 2  1 4  2 3  2 2
001 4  2  1 3  1 3  3 3  2 4

For sex-linked data, all alleles on a ``Y'' chromosome should be coded as ``Y'' in the marker data. Here is a sample sex-linked marker data file:

mark1 mark2 mark3 mark4
001 1  0  1 Y  1 Y  3 Y  2 Y
001 2  0  0 0  3 4  2 3  2 4
001 3  2  1 Y  4 Y  2 Y  2 Y
001 4  2  1 3  1 3  3 3  2 4

To simplify the use of marker data files prepared for other programs, the ASPEX programs can also read LINKAGE marker data files, with or without the header line describing which markers are present. In this case, the marker data should have the form:

[fid] [pid] [dad] [mom] [sex] [1a] [1b] [2a] [2b] ...

or if a disease status field is present:

[fid] [pid] [dad] [mom] [sex] [sick] [1a] [1b] [2a] [2b] ...

In this format, three new columns are used to explicitly identify the parents and gender of each individual. The ID's of parents who are not present should be the same as the blank allele specifier. The [sex] field should be either `M', `m', or `1' for males; or `F', `f', or `2' for females. Family structures other than simple nuclear families will generate error messages.

If a LINKAGE format file has a header line listing allele names, the ASPEX programs will use it. However, the header line is optional if a single LINKAGE format file contains data for all nloc markers in the order they are specified in the loc parameter.

Here is the a sample marker data file in linkage format:

mark1 mark2 mark3 mark4
001 1  0 0 m  0  1 1  1 2  1 3  2 2
001 2  0 0 f  0  0 0  3 4  2 3  2 4
001 3  1 2 m  2  1 2  1 4  2 3  2 2
001 4  1 2 m  2  1 3  1 3  3 3  2 4

For sex-linked data in LINKAGE format, males can be listed either as homozygotes at all loci, or with one allele coded as `Y' at each locus. Here is a sample sex-linked data file, with males coded as homozygotes:

mark1 mark2 mark3 mark4
001 1  0 0 m  0  1 1  1 1  3 3  2 2
001 2  0 0 f  0  0 0  3 4  2 3  2 4
001 3  1 2 m  2  1 1  4 4  2 2  2 2
001 4  1 2 f  2  1 3  1 3  3 3  2 4

Finally, as a special case, a data file may contain only the pedigree, gender, and affected-status columns, and no genotype data. This is useful if several definitions of affected status are used, because the status information can be kept separate from the genotyping data. In this case, the header line should contain just the single word ``status'', like:

status
001 1  0 0 m  0
001 2  0 0 f  0
001 3  1 2 m  2
001 4  1 2 f  2


Next Previous Contents