A marker input data file normally consists of a single header line, followed by multiple marker data records, with one line of marker data per person. The ASPEX programs handle several file format variants by automatically determining the format of the input data. There are two basic data formats: the ``basic'' or ``Risch'' format, with pedigree information implied by line ordering, and pre-makeped LINKAGE format, with explicit pedigree information. The Risch format is supported mainly for historical reasons, and we discourage its use because the extra information in a LINKAGE file can help catch some kinds of errors.
The header line consists of a space-separated list of marker names for which this file contains data.
Each record of the basic Risch marker data format looks like:
[fid] [pid] [1a] [1b] [2a] [2b] [3a] [3b] ...
or if a disease status field is available:
[fid] [pid] [sick] [1a] [1b] [2a] [2b] [3a] [3b] ...
where [fid]
is a unique identifier for this family,
[pid]
is an identifier for this person, [sick]
indicates if the person is sick or healthy, and the rest of the
fields are pairs of allele identifiers. Family, person, and allele
identifiers can be arbitrary numbers or strings of up to 15
characters. The family and person identifiers may be separated by a
period. Other fields must be separated by blank space (either spaces
or tabs).
For [sick]
, values of `Y', 'y', `T', 't', and 2 indicate the
person is affected, and 'N', 'n', 'F', 'f', and 1 indicate unaffected.
If the value matches '?', 'U', 'u', or the blank
allele, the
person will be considered to have unknown disease status. These
mappings can be changed using the sick_set
, well_set
, and
unk_set
parameters. Siblings
with unknown disease status will never be included in a sib pair, but
will be used for reconstructing missing parents.
Within a group of records for a complete family, the first record should describe the father, followed by the mother, followed by the children. Blank lines are ignored, and spacing within a line is not important.
Here is a sample marker data file for a single family, with four marker positions, with a column indicating disease status:
mark1 mark2 mark3 mark4
001 1 0 1 1 1 2 1 3 2 2
001 2 0 0 0 3 4 2 3 2 4
001 3 2 1 2 1 4 2 3 2 2
001 4 2 1 3 1 3 3 3 2 4
For sex-linked data, all alleles on a ``Y'' chromosome should be coded as ``Y'' in the marker data. Here is a sample sex-linked marker data file:
mark1 mark2 mark3 mark4
001 1 0 1 Y 1 Y 3 Y 2 Y
001 2 0 0 0 3 4 2 3 2 4
001 3 2 1 Y 4 Y 2 Y 2 Y
001 4 2 1 3 1 3 3 3 2 4
To simplify the use of marker data files prepared for other programs, the ASPEX programs can also read LINKAGE marker data files, with or without the header line describing which markers are present. In this case, the marker data should have the form:
[fid] [pid] [dad] [mom] [sex] [1a] [1b] [2a] [2b] ...
or if a disease status field is present:
[fid] [pid] [dad] [mom] [sex] [sick] [1a] [1b] [2a] [2b] ...
In this format, three new columns are used to explicitly identify the
parents and gender of each individual. The ID's of parents who are not
present should be the same as the blank
allele specifier. The
[sex]
field should be either `M', `m', or `1' for males; or `F',
`f', or `2' for females. Family
structures other than simple nuclear families will generate error
messages.
If a LINKAGE format file has a header line listing allele names, the
ASPEX programs will use it. However, the header line is optional if a
single LINKAGE format file contains data for all nloc
markers in
the order they are specified in the loc
parameter.
Here is the a sample marker data file in linkage format:
mark1 mark2 mark3 mark4
001 1 0 0 m 0 1 1 1 2 1 3 2 2
001 2 0 0 f 0 0 0 3 4 2 3 2 4
001 3 1 2 m 2 1 2 1 4 2 3 2 2
001 4 1 2 m 2 1 3 1 3 3 3 2 4
For sex-linked data in LINKAGE format, males can be listed either as homozygotes at all loci, or with one allele coded as `Y' at each locus. Here is a sample sex-linked data file, with males coded as homozygotes:
mark1 mark2 mark3 mark4
001 1 0 0 m 0 1 1 1 1 3 3 2 2
001 2 0 0 f 0 0 0 3 4 2 3 2 4
001 3 1 2 m 2 1 1 4 4 2 2 2 2
001 4 1 2 f 2 1 3 1 3 3 3 2 4
Finally, as a special case, a data file may contain only the pedigree, gender, and affected-status columns, and no genotype data. This is useful if several definitions of affected status are used, because the status information can be kept separate from the genotyping data. In this case, the header line should contain just the single word ``status'', like:
status
001 1 0 0 m 0
001 2 0 0 f 0
001 3 1 2 m 2
001 4 1 2 f 2