POLYPHRED-3.5 DOCUMENTATION

/******************************************************************************
*                                                                             *
*  Program: PolyPhred                                                         *
*  Version: 3.5                                                               *
*  Copyright (C) 1996-2000                                                    *
*  by Deborah A. Nickerson, Scott Taylor, Natali Kolker                       *
*  University of Washington                                                   *
*                                                                             *
*  All rights reserved.                                                       *
*                                                                             *
*  This software is part of a test version of the PolyPhred                   *
*  distribution package.  It may not be redistributed, distributed            *
*  in modified form, or used for any commercial purpose, including            *
*  commercially funded sequencing, without written permission from            *
*  the authors and the University of Washington.                              *
*                                                                             *
*  This software is provided ``AS IS'' and any express or implied             *
*  warranties, including, but not limited to, the implied warranties of       *
*  merchantability and fitness for a particular purpose are disclaimed.       *
*  In particular, this disclaimer applies to any diagnostic purpose. In no    *
*  event shall the authors or the University of Washington be liable for      *
*  any direct, indirect, incidental, special, exemplary, or consequential     *
*  damages (including, but not limited to, procurement of substitute goods    *
*  or services; loss of use, data, or profits; or business interruption)      *
*  however caused and on any theory of liability, whether in contract,        *
*  strict liability, or tort (including negligence or otherwise) arising      *
*  in any way out of the use of this software, even if advised of the         *
*  possibility of such damage.                                                *
*                                                                             *
******************************************************************************/

POLYPHRED
---------

PolyPhred 3.5 

Please make sure to read the README file included in this distribution.


INTRODUCTION
------------

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA
sequence variation in the human genome.  The identification and typing of
these variations plays a central role in analyzing the relationships between
genome structure and function, and in understanding the allelic variation
within and among populations.  In addition, the typing of SNPs also plays a 
role in identifying mutated oncogenes, genetic and infectious diseases, in
matching tissues prior to transplantation, and in forensic and paternity 
testing.

Many techniques are used to identify sequence variants among different
individuals using DNA amplified by the polymerase chain reaction (PCR).  These
include denaturing gel electrophoresis, chemical or enzymatic cleavage, 
heteroduplex analysis, the analysis of single-stranded DNA conformations, and
direct sequencing of a PCR product. PolyPhred is a program that helps to 
accurately identify heterozygous sites in sequences produced with fluorescence-
based chemistries. The program compares sequence traces and searches for 
homozygotes and heterozygotes.  To detect heterozygotes, we have noted two 
consistent changes: 1) a significant drop in fluorescence peak height at a 
variant site when sequence traces obtained from homozygous individuals are 
compared to traces from heterozygous individuals, and 2) the presence of a 
second fluorescence peak in sequence traces from heterozygous individuals (1, 
2). PolyPhred scans for these two features when sequence traces are being 
compared (2). 


HOW POLYPHRED WORKS
-------------------

PolyPhred identifies putative heterozygotes for substitution SNPs by comparing 
traces in a sequence assembly. PolyPhred is designed as a member of an 
integrated suite of sequence analysis applications which includes Phred (3,4), 
Phrap (5), and Consed (6), and is not a stand alone program.  Phred provides the 
base-calls, base-call quality information and the peak size information.  Phrap 
is used to assemble the  input sequences into one or more contigs.  Using the 
information provided by Phred and Phrap, PolyPhred identifies substitution type 
polymorphisms among the traces.  PolyPhred tags the putative heterozygotes and 
the Consed program is used to to visually review the PolyPhred output.  The 
PolyPhred output is also written to a file in a format that can be easily parsed 
into a database program.  


WHAT'S NEW IN POLYPHRED
-----------------------

1.	PolyPhred generates a ranking system for columns and sites 
	that are potentially polymorphic. The ranks are ordered from 1 
	(highest) to  6 (lowest), and the default (recommended) ranks are 
	1-3.

2.	A set of new consed tags "polyphredRank" are introduced which allows 
	polyPhred to mark sites of different ranks with different colors. 
	The tags are applied to the consensus sequence and are defined in the
 	.consedrc file (see the INSTALL file)

3.	By using a new version of Consed, one can now visualize the 
	column ranks applied to consensus sequence by using a .consedrc file 
	(see INSTALL file).

4.	PolyPhred's output file has been modified to include a new output 
	parameter in the "POLY"  section (that gives the rank for the 
	polymorphic column), and in the "GENOTYPE" section  (that gives the 
	rank for all genotypes in the column)


WHAT'S NEW IN POLYPHRED-CONSED INTEGRATION 
------------------------------------------

1.	It is now possible to bring up ALL traces at once in a scrolling window
        for a particular location.  See the Consed documentation for more
	information.

2.	Reads can be put in alphabetical order.  See the Consed documentation 
        for more information.


COMMAND LINE
------------

1.      -ace or -a [name of ace file]

        The user must supply the name of the ace file containing the
        assembly information for the assembly to be scanned. 
	Required.

2.      -d [directory name]

        The user can supply the name of directory.
        Default "../"
	Optional.

3.      -quality or -q [number between 0 and 50]

        The Phred quality value used to determine search limits and
        to limit the base mismatches reported in the output. 
        Default 30.
	Optional

4.	-rank [number between 1-6]

        The user can apply the rank option to define the resulting output.
	E.g. by using rank 2 user gets putative polymorphic columns ranked 
        1-2.  
	Default 3.
	Optional. 
	
5.      -ratio or -r [number between 0.0 and 1.0]

        The ratio used to identify heterozygotes.  
	Default 0.65.
	Optional.

6.      -background or -b [number between 0.0 and 1.0]

        The ratio used to reduce background.  
	Default 0.25.
	Optional.

7.	-tag or -t [polymorphism or p | genotype or g | rank or r]
	
	Three tagging options are available in PolyPhred.  For Consed 8.0 or
	later users, the 'genotype' option marks all sequences with tags 
	indicating the PolyPhred determined genotype.  The 'polymorphism'
	option marks putative heterozygotes with the 'polymorphism' tag.  
	Option 'g or genotype' selects genotype tags. Option 'r or rank' 
        selects rank tags. By using a new version of Consed, one can now 
        visualize the column ranks applied to consensus sequence.
	Default 'g'.
	Optional.

8.	-group or -g [regular expression]

	Since PolyPhred makes comparisons among sequence traces it is 
	important to separate your traces into groups, especially if you are 
	using different sequencing chemistries. You should not compare 
	sequences generated with different chemistries (apples and 
	oranges). PolyPhred can make this separation for you if your 
	chromatagrams are named using a nomenclature that indicates
	sequencing chemistry.  The user can specify which traces are to
	be analyzed by PolyPhred with the appropriate regular expression.
	PolyPhred uses limited regular expressions like those  
	described on the  regexp(5) manual page to match the patterns.
	Default ".+" (PolyPhred would scan all  traces).   
	Optional.


OUTPUT
------

PolyPhred writes output both to the standard output port and the standard error
port.  The standard output contains information about the analysis such as 
the location of a putative polymorphism and the putative genotypes for each 
sample.  The standard error contains messages about PolyPhreds progress.
Generally the standard output is redirected to a file and the standard
error is allowed to be printed on the screen.  In both outputs, information
is printed between beginning and ending tokens to make parsing of the file
easy.


STANDARD OUT:
------------- 

1.      The first prints out header information which includes
        the version number, a unique integer for identifying
        the PolyPhred output (the same integer is used to
        uniquely identify the standard error), and the time
        of the execution.  

2.      The next line prints out the command line used for the
        run.  If optional parameters were not supplied on
        the command line then the defaults are printed.  The
        associated token is 'COMMAND_LINE'.

3.      Then, for each contig, Polyphred prints the contig name and
        length followed by sample information, polymorphism
        information, and genotype information.  The associated
        token is 'CONTIG'.

4.      For each sample, PolyPhred prints the name, the range of
        positions searched in unpadded read coordinates, and the 
        average Phred quality of the base-calls in the region searched.  
        The associated token is 'SAMPLE'.

5.      For each putative polymorphism, PolyPhred prints the
        position in unpadded contig coordinates, the five prime
        sequence context, the two identified alleles, the
        three prime sequence context and the column rank. 
        The associated token is 'POLY'. 

6.      For each putative polymorphism and for each sample
        PolyPhred prints a genotype line.  Each genotype line
        reports the position in unpadded contig coordinates, the
        position in unpadded read coordinates, the name of the
	sample, the two alleles which make up the genotype and the
        genotype rank. The associated token is 'GENOTYPE'.


STANDARD ERROR
--------------

1.      The first line prints out header information which includes
        the version number, a unique integer for identifying
        the PolyPhred output (the same integer is used to
        uniquely identify the standard error), and the time
        of the execution.  The token used to separate these
        lines from the remainder of the output is 'HEADER'.

2.      The next line prints out the command line used for the
        run.  If optional parameters were not supplied on
        the command line then the defaults are printed.  The
        associated token is 'COMMAND_LINE'.

3.      Then, for each contig, it prints various messages which
        report on the progress of the run.  The associated 
        token is 'MESSAGE'.


USING YOUR OWN DATA
-------------------

1.      Install PolyPhred (see README and INSTALL)

2.      Create the following directory structure in your working directory

                ./chromat_dir
                ./edit_dir
                ./phd_dir
                ./poly_dir

3.      Move your chromats into 'chromat_dir'

4.      Change directory into 'edit_dir'

5.      Type 'phredPhrap [basename]'

6.	Run polyphred by typing 'polyphred -ace [ace file]' and pipe the output
	to an appropriate file for viewing later.

7.      Run consed.


EXAMPLES:
--------- 
       If you already ran phredPhrap or phredPhrapPoly scripts,
       you have data in the following format (depending on which version
       of phredPhrap is run, there will either be a .fasta.ace or
       a .fasta.screen.ace file): 
       	   Example:
	     /home/gene ------|----edit_dir--|--- gene.fasta.screen.ace 
                              |----poly_dir
                              |----chromat_dir
                              |----phd_dir 

             Files in phd_dir : 
             gene.r.phd.1,  gene.s.phd.2 -primer
             gene.x.phd.1,  gene.y.phd.2 -terminator

         Now you can run PolyPhred with different parameters.
 
1. Option: '-d'.
	
	You can run PolyPhred from any directly by typing :
        PolyPhred -d /home/gene -ace gene.fasta.ace > PolyPhred.out
      	or
        cd /home/gene/edit_dir
        PolyPhred -ace gene.fasta.ace > poly.out
        
2. Option: '-group'.
        
	If you want to run PolyPhred only on gene.x.phd.1, gene.y.phd.2
	reads only, you type: 
	PolyPhred -d /home/gene -ace gene.fasta.ace -group '.[xy].' > p.out
        or
        cd /home/gene/edit_dir
        PolyPhred -ace gene.fasta.ace -group '.[xy].' > poly.out
	
3. Option '-rank':
	
	By default polyphred run with '-rank 3', which will bring you
	all columns with ranks 1-3.

        if you want ranks from 1-2 type:  
	polyphred -d /home/gene -ace gene.fasta.ace -rank 2  > poly.out 
	
        or if you want ranks 1-6 type:
        polyphred -d /home/gene -ace gene.fasta.ace -rank 6 > poly.out 
       
4. Option '-tag':
	 
	By default polyphred applies 'polymorphism' tags, for setting 
	'genotype' tags, you type:
	
	polyphred -d /home/gene -ace gene.fasta.ace -tag g > poly.out
	polyphred -d /home/gene -ace gene.fasta.ace -tag genotype > poly.out
        
	

NOTES ON PERFORMANCE
--------------------

1.      New energy transfer terminators substantially improve PolyPhred's 
        accuracy in calling heterozygotes using primer or terminator    
        chemistries.

2. 	In general, the error rate (i.e. false negative rate) decreases with
        increasing quality and increasing rank parameter.

3.      In general, the false positive rate decreases with increasing quality
        parameter and decreasing rank parameter.

4.      In general, the amount of sequence scanned by PolyPhred increases
        with decreasing quality parameter.

5.      PolyPhred does not search for insertions/deletions, however these
        types of polymorphisms might be found in two ways.  First, phrap
        will sometimes separate heterozygotes and homozygotes into separate
        contigs.  Second, if heterozygotes and homozygotes are assembled in
        the same contig, the heterozygotes will typically appear as several
        consecutive or nearly consecutive false positives followed by the end
        of the region scanned by PolyPhred due to decreases in quality because
        of the overlapping peaks involved in the insertion/deletion 
	polymorphism.


NOTES ON ANALYSIS
-----------------

        We have found for large projects that generating an annotated 
	reference sequence containing known sequence features can greatly 
	enhance the analysis of sequence variations (7).


NOTES ON CUSTOMIZATION
----------------------

	You have to use the '.consedrc' file to  visualize new PolyPhred
	rank  tags with consed. For more information consult INSTALL and
        the Consed  documentation.


REFERENCES
----------

(1)     Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A.,
        1994, "Comparative analysis of human DNA variations by fluorescence-
        based sequencing of PCR products", Genomics 25, 615-622.

(2)     Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: 
        automating the detection and genotyping of single nucleotide 
        substitutions using fluorescence-based resequencing", Nucleic Acids 
        Research, 25: 2745-2751.

(3)     Ewing, B., Hillier, L., Wendl, M.,  and Green, P., 1998, "Basecalling
	of automated sequencer traces using phred.  I. Accuracy assesment",
	Genome Research 8: 175-185.

(4)     Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer 
	  traces using phred.  II. Error probabilities", Genome Research 8: 
        186-194.  

(5)     Green, P., 1994, Phrap, unpublished.
        http://www.genome.washington.edu/

(6)     Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical
	tool for sequence finishing", Genome Research 8:195-202.

(7)	Rieder, M.J., Tobe, V.T., Taylor, S.L., and Nickerson, D.A., 1998,
	"Automating the identification of DNA variations using quality-based
	fluorescent resequencing: Analysis of the human mitochondrial genome",
	Nucleic Acids Res. 26: 967-973.