*                                                                             *
*  Program: Refcomp                                                           *
*  Version: 1.0                                                               *
*  Copyright (C) 1996-1997                                                    *
*  by Deborah A. Nickerson,  Natali Kolker, Scott Taylor,  and Mark Rieder    *
*  University of Washington                                                   *
*                                                                             *
*  All rights reserved.                                                       *
*                                                                             *
*  This software may not be redistributed, distributed                        *
*  in modified form, or used for any commercial purpose, including            *
*  commercially funded sequencing, without written permission from            *
*  the authors and the University of Washington.                              *
*                                                                             *
*  This software is provided ``AS IS'' and any express or implied             *
*  warranties, including, but not limited to, the implied warranties of       *
*  merchantability and fitness for a particular purpose are disclaimed.       *
*  In particular, this disclaimer applies to any diagnostic purpose. In no    *
*  event shall the authors or the University of Washington be liable for      *
*  any direct, indirect, incidental, special, exemplary, or consequential     *
*  damages (including, but not limited to, procurement of substitute goods    *
*  or services; loss of use, data, or profits; or business interruption)      *
*  however caused and on any theory of liability, whether in contract,        *
*  strict liability, or tort (including negligence or otherwise) arising      *
*  in any way out of the use of this software, even if advised of the         *
*  possibility of such damage.                                                *
*                                                                             *


Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA
sequence variation in the human genome.  The identification and typing of
these variations plays a central role in analyzing the relationships between
genome structure and function, and in understanding the allelic variation
within and among populations.  In addition, the typing of SNPs also plays a 
role in identifying mutated oncogenes, genetic and infectious diseases, in
matching tissues prior to transplantation, and in forensic and paternity 

Many techniques are used to identify sequence variants among different
individuals using DNA amplified by the polymerase chain reaction (PCR).  These
include denaturing gel electrophoresis, chemical or enzymatic cleavage, 
heteroduplex analysis, the analysis of single-stranded DNA conformations, and
direct sequencing of the PCR product.  Of these methods, direct DNA sequencing
is the most accurate and the most automated approach for scanning DNA fragments
for variation.  Furthermore, it is the only method that provides complete
information about the location and nature of the sequence variation using a
single set of reagents and assay conditions.  Despite the advantages of 
sequencing PCR products to identify DNA variations, there is one drawback:
it is difficult to accurately identify heterozygous sites within a sequence.
By comparing sequence traces containing variant sites for homozygotes and
heterozygotes we have noted two consistent changes: 1) a significant drop in
fluorescence peak height at a variant site when sequence traces obtained from
homozygous individuals are compared to traces from heterozygous individuals,
and 2) the presence of a second fluorescence peak in sequence traces from
heterozygous individuals (1, 2).  We have developed a program known as 
PolyPhred to scan for these two features when sequence traces are being 
compared (2). 

Refcomp (6) was designed to analyze sequencing traces which contains data from strictly 
homozygous samples (eg. cloned DNA, mitochondrial DNA, etc.).  This data represents a 
special case which can be analyzed for mismatches with a known reference sequence.  Refcomp 
will determine the high quality positions within an assembled DNA contig and produce a 
report listing these sites.


Refcomp is designed as a member of an  integrated suite of sequence analysis 
applications which includes Phred (3,4),  Phrap (5), and Consed (6), 
and is not a stand alone program. Phred provides the base-calls, base-call 
quality  information and the peak size information.  Phrap is used to assemble
the  input  sequences into one or more contigs. 
Refcomp will parse contig alignment data files (.ace files), produced by the
program Phrap, and identify sites in the consensus sequence which differ from
a defined reference sequence.  Refcomp will generate data tags for the
mismatched sites and the Consed program is used to visually review the Refcomp
output.  The Refcomp output is also written to a file in a format that can be
easily parsed into a database program.  


1)      -ace [name of ace file]

        The user must supply the name of the ace file containing the
        assembly information for the assembly to be scanned. Required.

2)      -d [directory name]
        The user can supply the name of directory.
        Optional. Default "../"

3)      -quality [number between 0 and 50]

        The Phred quality value used to determine search limits and
        to limit the base mismatches reported in the output. Optional.
        Default 20.


Refcomp writes output both to the standard output port and the standard error
port.   The standard output contains information about the analysis such as 
the location of a putative polymorphism and the putative genotypes for each 
sample.  The standard error contains messages about PolyPhreds progress.
Generally the standard output is redirected to a file and the standard
error is allowed to be printed on the screen.  In both outputs, information
is printed between beginning and ending tokens to make parsing of the file


1)      Prints out the command line used for the
        run.  If optional parameters were not supplied on
        the command line then the defaults are printed.  The
        associated token is 'COMMAND_LINE'.
2)      For each contig, prints the contig name. The associated
        token is 'CONTIG'.
3)      For each mismatched site, Refcomp prints the position in designated
	reference sequence position (R.  pos.), consensus sequence position
	 (C.pos.), the base found in the reference sequence (R.  base), 
	the base found in the consensus sequence (C.  base), the consensus 
	sequence phrap quality (C.  Quality), a flag for opposite strand 
	confirmation (1 = confirmation, 0 = no confirmation, and a flag
	designating whether a forward or reverse read was found at that
	position (1 = read present, 0 = no read present).  The
	associated token is 'SITES'.

4) 	For each mismatched position, Refcomp prints the name of the sequence
	chromatogram covering that site.  Additional information is given 
	in the following order:  reference sequence position (R.  pos),
	consensus sequence position (C. pos), chromatogram name, base found
	in the read (Base), read phred quality associated with the base at
	this position (Q.  phred) and phrap quality at this position in the
	consensus sequence.  The associated token is 'READS'.

5)      Finally, Refcomp reports the total number of sites in found which
	differ from the reference sequence.  The associated token is

Standard ERROR

1)      First prints out header information which includes
        the version number, a unique integer for identifying
        the Refcomp output (the same integer is used to
        uniquely identify the standard error), and the time
        of the execution.  The token used to separate these
        lines from the remainder of the output is 'HEADER'.

2)      Next prints out the command line used for the
        run.  If optional parameters were not supplied on
        the command line then the defaults are printed.  The
        associated token is 'COMMAND_LINE'.

3)      Then, for each contig, prints various messages which
        report on the progress of the run.  The associated 
        token is 'MESSAGE'.


These instructions will allow you to take chromatograms, analyze them with Phred and Phrap 
and run Refcomp on the .ace file.  

1)      Install Refcomp (see README and INSTALL)

2)      Create the following directory structure in your working directory


3)      Move your chromats into 'chromat_dir'

4)	Put your reference sequence .phd file into 'phd_dir'
	For more information on creating a "reference sequence" see the notes below.

5)      Change directory into 'edit_dir'

6)      Type 'phredPhrapRef'
	You could also run Refcomp as a stand alone product

7)      Run consed.


	/home/gene ------|----edit_dir--|--- gene.fasta.ace 

        You can run Refcomp from any directoty by typing :

        1) refcomp -d /home/gene -ace gene.fasta.ace  > refcomp_output
           cd /home/gene/edit_dir/
           refcomp -ace gene.fasta.ace  > refcomp_output
        2) If you would like to run Refcomp with a different quality threshold            you may use the command:
           refcomp -ace gene.fasta.ace -quality  20 > refcomp_output


The program Refcomp relies on the comparison of the consensus sequence 
generated by Phrap and a "reference sequence" which assembles in the same
contig.  In order to create a reference sequence you need the sequence you
are interested in.  This sequence can then be formatted to look like a .phd
file created by the program Phred.  The reference sequence file is placed in 
the phd_dir and after running Phrap will be included in the assembly and 
should be able to be viewed in Consed.  ** Note:  In order to be recognized 
as a reference sequence the name of the file must contain the character string
"ref" (they can be upper/lower/mixed case) in its name.  Examples of other 
reference sequences for the mitochondrial genome can be found at

Refcomp tags all mismatched sites with the specific "polymorphism" tag found
in Consed.  
This enables the users to easily and rapidly navigate between sites and allows
for easy verfication by the data analyst.  
Refcomp has a default setting of quality 20 for tagging mismatched sites. 
The quality setting has been used in previous studies for finding homoplamic
polymorphisms in the mitochondrial genome with single pass sequence coverage


(1)     Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A.,
        1994, "Comparative analysis of human DNA variations by fluorescence-
        based sequencing of PCR products", Genomics, 25, 615-622.

(2)	Nickerson, D.A., Tobe, V.O., Taylor, S.L., 1997 "PolyPhred: automating
        the detection and genotyping of single nucleotide substitutions using 
	fluorescence-based resequencing"  Nucleic Acids Res.
	25(14): 2745-2751.

(3)     Ewing, B. and Green, P., 1992, Phred, unpublished.

(4)     Green, P., 1994, Phrap, unpublished.

(5)     Gordon, D., 1995, Consed. unpublished.

(6) 	Rieder, M.J., Taylor, S.L., Tobe, V.O., and Nickerson, D.A.,  
	1998 "Automating the identification of DNA variations using 
	quality-based fluorescence re-sequencing: analysis of the human 
	mitochondrial genome",  Nucleic Acids Res. 26(4):967-973.