All rights reserved.
This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.
This software is provided ``AS IS'' and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.
Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.
Detection of heterozygous sequences is based on finding: (1) a significant drop in fluorescence peak height at a variant site when sequence traces obtained from homozygous individuals are compared to traces from heterozygous individuals (theoretical drop is expected to be 50%), and (2) the presence of a second fluorescence peak in sequence traces from heterozygous individuals (see references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2).
PolyPhred identifies substitution SNPs as potential heterozygotes by comparing traces in a sequence assembly. PolyPhred is designed as a member of an integrated suite of sequence analysis applications which includes Phred (references 3,4), Phrap (reference 5), and Consed (reference 6), and is not a stand alone program.
Phred provides the base-calls, base-call quality information and the peak size information. The information is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It begins by reading the ACE file, and uses this file to locate all of the other needed files. When running PolyPhred, the user has the option of either specifying the ACE file explicitly using the -ace flag in the command line (see The Flags), or allowing PolyPhred to locate the ACE file.
PolyPhred identifies SNP sites among the traces and assigns a rank indicating how well the trace at a site matches the expected pattern for a SNP (see How PolyPhred ranks SNP sites). After PolyPhred identifies the putative heterozygous sites, it updates the ACE and PHD files by adding tags that indicate the positions of the sites. The Consed program can then be used to examine the tagged sites. PolyPhred also generates a detailed report listing positions, genotypes and ranks of polymorphic sites in a format that can be easily parsed into a database program.
Parameters and options that govern the operation of PolyPhred can be changed using command-line flags. All of the flags are optional. If no flags are used, PolyPhred uses the ACE file with the highest number at the end of the file name, and all other parameters are set to the defaults.
Differences with PolyPhred version 3.5 are indicated in red. Note that the -background (-b) and -ratio flags, which were used in previous versions of PolyPhred, are no longer functional in version 4.0. See the section How PolyPhred ranks SNP sites for a discussion on how to alter the way PolyPhred ranks sites.
Many of the flags have an abbreviated form, which are shown in parentheses. For those flags that take an argument, the argument is shown in square brackets ([ ]). Optional arguments are indicated in green.
-ace (-a) [ace file]
Specifies the ACE file. If this flag is omitted, PolyPhred uses the ACE file with the highest final number in the file name.
-blocks [list of block names]
Causes PolyPhred to include or exclude from the output report the specified blocks (i.e. POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, MANUALGENOTYPE, VERIFIED, SAMPLE and COVERAGE). To include a block, immediately precede the block name with a plus (+), To exclude a block, immediately precede the block name with a minus (-). For example, to exclude the SAMPLE and COVERAGE blocks from the output report, add this to the command line:
-blocks -SAMPLE -COVERAGENote that if the indel option is set to off, the INDEL block will not appear in the output.
Causes PolyPhred to clear all polyPhred tags from the ACE and PHD files.
-dir (-d) [work directory]
Specifies the directory in which the data is located. This flag allows the user to run PolyPhred from a directory other than the one containing the data to be analyzed. If this flag is not used, PolyPhred should be run from the edit_dir directory of the data set to be analyzed.
-flanking (-f) [number]
Specifies the number of bases flanking a polymorphic site to report in the POLY block of the output.
-group (-g) [regular expression]
Specifies the files to be used in the analysis. PolyPhred analyzes only those sequences with a name that matches the regular expression.
Causes PolyPhred to print the help text. The flags are listed along with their allowed and default values.
-indel (-i) [on / off]
Allows the user to switch on or off the search for indel polymorphisms. If no argument is given, this feature is turned on. The indel search function is currently under development (see Detection of Insertion/Deletion Polymorphisms).
-nav (-n) [file name / on / off]
Causes PolyPhred to write a navigation file containing the polymorphic sites. If the file name is given but does not have a final ".nav" extension, PolyPhred adds one. The file is written to the edit_dir directory of the working directory. If the argument is "on", or no argument is not given, a navigation file is written using the default navigator file name.
-output (-o) [output file name / on / off]
Causes PolyPhred to write the results to an output report. If no output file name is given, the default output file name is used. The file is written to the edit_dir directory of the working directory. If the argument is "on", or no argument is given, an output report is written using the default output file name. If this feature is turned off, the results are written to standard output, and can be redirected to a file.
-quality (-q) [number]
Specifies the lower quality limit. This value affects the extent of the excluded regions at the ends of the sample sequences (regions shaded in yellow when viewed in Consed). Reducing this value results in shorter excluded regions, and increasing the value results in longer excluded regions. The quality limit can also affect the ranking of SNPs (see How PolyPhred ranks SNP sites).
-rank (-r) [number]
Specifies the lower limit for the rank assigned to SNP sites. PolyPhred marks only those sites that are assigned a rank from 1 to the specified limit, inclusive. Adjusting the rank limit will affect the stringency of the SNP search (see How PolyPhred ranks SNP sites).
-ref [reference sequence identifier / on / off]
Causes PolyPhred to include the position of polymorphic sites relative to a reference sequence. For each site, the reference sequence will be printed immediately after the consensus sequence position in the POLY, GENOTYPE, COLUMNGENOTYPE, MANUALGENOTYPE and INDEL blocks of the output. The reference sequence identifier argument is optional. If identifier is specified, PolyPhred searches for a PHD file name containing this string of characters. When such a file is found, the sequence within this file is set as the reference sequence. If the argument is "on", or no argument is given, this feature is turned on and the default reference sequence identifier is used.
-scale (-s) [number]
Specifies the scale factor that PolyPhred will use to determine how well the trace of a putative SNP matches the ideal. Using this flag will set the scales for both the secondary/primary peak area ratio and the heterozygous/homozygous peak height ratio. These scales can be set independently using the -s1 and -s2 flags (see below). Adjusting the scale values will affect the ranking of SNPs (see How PolyPhred ranks SNP sites).
-scale1 (-s1) [number]
Specifies scale factor that PolyPhred will use to scale the secondary/primary peak secondary/primary peak area ratio (see the -s flag above).
-scale2 (-s2) [number]
Specifies scale factor that PolyPhred will use to scale the heterozygous/homozygous peak height ratio (see the -s flag above).
-tag (-t) [tag type]
Specifies the type of tag that PolyPhred will use to mark SNP sites. The three tag types are genotype, polymorphism, and rank. The tag types can be abbreviated as g, p and r, respectively.
-update [on / off]
Allows the user to control updating the ACE and PHD files. If updating is switched off, only the output report will be written. If the no switch if given, updating is turned on.
-verbosity (-v) [number]
Specifies the level of status reporting that PolyPhred will print to the screen as it is running. The allowed values range from 0 (least reporting) to 2 (most reporting).
Prints the PolyPhred version number and build number.
-window (-w) [number]
Specifies the window width. This value, together with the quality value, is used to determine how much of the ends of each sample sequence will be excluded from analysis.
The traces of a heterozygous site characteristically appear as two overlapping peaks. Ideally, the areas under the peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a homozygous peak would be at the same position. When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a rank that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. If a site is deemed not to be heterozygous, it is assigned a rank indicating how well it fits the expected trace for a homozygous site. The ranks range from 1, indicating a very good fit, to 6, indicating a very poor fit. (In fact, the vast majority of sites with ranks less than 3 are not polymorphic.)
For each position in the consensus sequence, PolyPhred first examines all sites that are aligned at the position (i.e. the sites that line up in a column when the sequences are viewed in Consed). Each site is assigned a genotype and a rank. Next, PolyPhred counts the number of heterozygous sites with a rank equal to or better than a threshold called the rank limit. If there is at least one such site, PolyPhred marks the column as a polymorphic position and assigns an overall rank to the position. This rank is generally the rank of the best heterozygous site in the column. Consequently, the column rank is never below the rank limit.
In Consed, columns that are marked polymorphic appears as blue with pink markers indicating the heterozygous sites. At the top of each column, the base in the consensus sequence is marked with a color indicating the overall rank:
PolyPhred primarily uses three factors to determine the rank of a heterozygous site. One is the ratio of the areas under the two peaks (called the area ratio). The second is the ratio of the actual height of one of the peaks to the height of a hypothetical homozygous peak (called the normalization ratio). The peak that is used corresponds to the consensus base at the position. The third factor is the average quality, assigned by Phred, of the sites flanking the heterozygous site (the two sites immediately adjacent to the heterozygous site are excluded from the average, as Phred typically reduces their quality due the heterozygous site itself). After assigning an initial rank based on these three factors, PolyPhred examines other aspects of the trace, such as the presence of a third peak, and adjusts the rank accordingly.
There are several flags that the user can use to affect how PolyPhred ranks sites, and thereby increase or decrease the number of positions that are marked polymorphic. These flags and their effect on calling of polymorphic positions are discussed below.
The -rank flag sets the rank limit. Setting the rank limit lower (i.e. toward 6) will result in more positions marked as polymorphic. However, the lower the rank limit, the more of calls will likely be false positives. The number of false positives can be reduced by raising the rank limit (i.e. toward 1). However, this can result in true polymorphic positions being missed.
The parameters that are set with the -quality and -window flags affect the length of the regions at the ends of the samples sequences that are excluded from the search for SNPs. These regions appear yellow in Consed. PolyPhred determines the boundary of these regions by calculating the average base quality (as determined by Phred) within a sliding window. The window slides in from the ends of the sequence and stops when the average quality reaches or exceeds a threshold, or quality limit, which the user can set with the -quality flag. The border of the excluded region is then set at the first base within the window with a quality at least 75% of the quality limit. The size of the sliding window can by set by the -window flag. Increasing the quality limit will result in more of the ends being excluded from the search, and in general, reducing the number of positions that are marked polymorphic. Altering the window size results in smaller and less predicable changes in the position of the border. In general, decreasing the window size tends to move the border further inward.
Changing the quality limit can also affect how individual sites are ranked. Increasing the quality limit can result in reducing the ranks of some sites (i.e, toward 6), and possibly a reduction in the number of positions marked polymorphic.
The stringency with which PolyPhred compares putative heterozygous sites with the ideal trace can be altered by using the scale flags -scale, -scale1 and -scale2. The -scale1 flag works on the area ratio, and the -scale2 flag works on the normalization ratio. The -scale flag affects both ratios simultaneously.
PolyPhred uses the scale values to adjust the two ratios before comparing them with the expected ratios. Setting either scale value to a number less than 1 will case PolyPhred to require a tighter match between the actual and ideal traces, and therefore result in fewer SNPs being called. Conversely, setting a scale value to a number greater than 1 will result in more SNPs being called.
The new flag -o allows the user to specify the file name of the output report as a command line option rather than by redirecting the standard output. If the user uses the same file name for all output reports, the -o flag can be put in a .polyphredrc file (see Customizing PolyPhred). If the the -o flag is not used, the report will be written to standard output as usual, and can be redirected to a file.
To facilitate parsing of the output report, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.
The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.
Next is the COMMAND_LINE block. In this block are listed the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.
The rest of the report contains results for one or more contigs. The contigs must be valid (i.e. contain more than one valid sample sequence) to appear in the report. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks actually appear in the output report by using the -block flag.
If the -ref flag is used, the position relative to a reference sequence is written in the second field immediately after the consensus sequence position in the POLY, GENOTYPE, COLUMNGENOTYPE, MANUALGENOTYPE, VERIFIED and INDEL blocks.
The POLY block
The positions where SNP sites were identified are listed in this block. Each line reports the consensus sequence position, 5' sequence flanking the polymorphic site, the two major alleles found at the position, 3' sequence flanking the polymorphic site, and the over-all rank assigned to the position.
The GENOTYPE block
For each position where a SNP site was identified, the sample sequences that cover the position are listed. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the rank.
The COLUMNGENOTYPE block
For each manually-tagged position on the consensus sequence, the sample sequences that cover the position are listed. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the rank.
This is a new block.
The INDEL block
If the -indel flag is used, a list of the identified insertion/deletion sites will be listed in the INDEL block. Each line reports the consensus sequence position, the position relative to the sample sequence in which the indel was found, the name of the sample sequence, and the size of the indel. A positive size indicates an insertion, and a negative size indicates a deletion.
This is a new block.
The MANUALGENOTYPE block
Sample sequence sites that have been tagged manually are listed in this block. PolyPhred obtains the user-defined tags from the .polyphredrc file (see Customizing PolyPhred). Each line reports the consensus sequence position of a tagged site, the position relative to the sample sequence that was tagged, and the identity of the tag.
This is a new block.
The VERIFIED block
Positions on the consensus sequence that have been manually marked as verified sites are listed in this block. PolyPhred obtains the user-defined tags from the .polyphredrc file (see Customizing PolyPhred). Each line reports the consensus sequence positions and the tag identity.
This is a new block.
The SAMPLE block
The names of the sample sequences that were analyzed and their sequence qualities are listed in this block. Each line reports the name of a sequence, the left and right limits of the region searched by PolyPhred, and the average site quality, as determined by Phred, within the search region. The limits of the search region are calculated using the -quality and -window parameters and are indicated by the yellow regions at the ends of the sample sequences in Consed.
The COVERAGE block
This block provides a tally of the number of sample sequences that PolyPhred examined at each position. Each line reports the begin and end positions of a range relative to the consensus sequence, followed by the number of sample sequences that were analyzed in the range.
This is a new block.
Consed allows the user to create custom tags that can be manually applied to the consensus sequence and sample sequences. Manual tags can be used to mark special sites or regions and provide information about them, or to override calls made by Phred or PolyPhred. Consed stores these tags in the ACE and PHD files.
PolyPhred is able to capture three types of manual tags and provide information about them in the output report. To activate the tag-capturing feature, it is necessary to create a .polyphredrc file and indicate in the file the tag types to be captured (see Customizing PolyPhred).
The manualtag type
This tag type is used to mark sites in sites in sample sequences in sample sequences. Typically this is done to modify or override the genotype call made by Phred or PolyPhred. The captured tags are listed in the MANUALGENOTYPE block.
The verifiedtag type
This tag type is applied to the consensus sequence to indicate positions verified as polymorphic by an analyst. The captured tags are listed in the VERIFIED block.
The columntag type
This tag type is applied to positions on the consensus sequence and is used to force PolyPhred to genotype the column of sites at those positions. This is done in addition to the normal search and genotyping function that PolyPhred performs. Sites that are genotyped under a one the these manual tags are listed in the COLUMNGENOTYPE block.
Searching for insertion/deletion (indel) polymorphisms is a new optional feature in this version of PolyPhred. When PolyPhred locates a putative indel site within a sample sequence, it excludes the sequence downstream of the indel site from the search for SNP sites.
The indel search algorithm is still under development, and there is room for improvement in future versions. Currently, PolyPhred identifies only those sample sequences that appear to be heterozygous for an indel. Sequences that are homozygous for an indel relative the consensus sequence are not marked. Also, PolyPhred tends to under-report the presence of indels. For these reasons, the indel search option is set to off by default, and can be activated by using '-i on' in the command line. However, users who wish to change the default setting to 'on' can do so in the .polyphredrc file.
When PolyPhred identifies an indel site, it inserts an 'indelSite' tag in the ACE file, and an 'indel' tag in the PHD file of the sample sequence containing the indel. If the indel is within a microsatellite, the indelSite tag is written with the comment 'microsat'.
Consed currently does not have a build-in function for interpreting the new indel and indelSite tags. To allow Consed to interpret these new tags, it is necessary to modify the .consedrc file. If this is not done, Consed will report and error. Add the following lines to the .consedrc file:
consed.customConsensusTag1: indelSite consed.tagColorCustomConsensusTag1: DarkCyan consed.customTag1: indel consed.tagColorCustomTag1: DarkOrange
If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.
phred version 0.961028 or later phrap version 0.960731 or later phd2fasta version 0.971024 or later phredPhrap consed version 8.0 or later mktrace
polyphred EXAMPLE.doc polyphred.html (this document) example/chromat_dir/ (this directory should contain 10 files.) example/edit_dir/ (this directory should contain 15 files.) example/phd_dir/ (this directory should contain 10 files.) example/poly_dir/ (this directory should contain 10 files.)
# $polyPhredExe = "/usr/local/genome/bin/polyphred";
$bUsingPolyPhred = 0;
Phred, Phrap, Consed and PolyPhred all require a fixed directory structure for analyzing sequence data. If a gene to be analyzed is called "mygene", for example, the directory structure should look like this:
mygene/containing the subdirectories:
chromat_dir/ edit_dir/ phd_dir/ poly_dir/
PolyPhred can be customized to suit the preferences of the user by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and written in the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameter values and will not capture manual tags.
When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks for a directory in the user's shell rc file. The directory is specified by including in the shell rc file the line:
setenv POLYPHRED_PATH [path]where [path] is the directory containing the .polyphredrc file.
The default values of any parameter can be changed in the .polyphredrc file. To change a default value, enter the key word "flag", the command-line flag for the parameter and the new default value. Each entry must be on a separate line. For example, to change the default value for the rank limit to 2 and the quality limit to 25, enter these two lines in the .polyphredrc file:
flag -r 2 flag -q 25
If the "flag" key is used to change the default file names for the -nav and -output flags, or the default reference identifier for the -ref flag, these features will also be turned on by default. To make these changes while keeping the features off by default, a different set of key words must be used:
outputfile [file name] navfile [file name] refID [identifier]The first line changes the default file name of the output report, the second line changes the default file name of the navigation file, and the third line changes the default reference sequence identifier.
To specify the manual tags that are applied to sample sequences, list each tag on a separate line, preceded by the key word "manualtag". Sites marked with these tags will be listed in the MANUALGENOTYPE block of the output report.
To specify the manual tags that mark verified positions in the consensus sequence, list each tag on a separate line, preceded by the key word "verifiedtag". Positions marked with these tags will be listed in the VERIFIED block.
To specify the manual tags that mark columns for forced genotyping, list each tag on a separate line line, preceded by the key word "columntag". Genotype information for each sample sequence covering the tagged positions will be listed in the COLUMNGENOTYPE block.
The date that appears at the top of the output report can be changed from the "day/month/year" format, which is the default, to the "month/day/year" format. To do this, put this line in the .polyphredrc file:
Blank lines are permitted. In addition, a line that begins with the character # is treated as a comment.
An example of a .polyphredrc file:
date MDY flag -q 25 flag -f 16 outputfile report.txt refID .refSeq # Manual Tags verifiedtag polymorphism columntag manualGenotype manualtag heterozygote manualtag homozygote manualtag indel
If you have questions or problems with Phred, Phrap or Consed, or you need to obtain
these programs, please see the web site at:
If you have questions or problems with PolyPhred, please
Follow the "PolyPhred" link for the email address of the person to contact. Please do not email questions to the web master.
1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994 "Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products", Genomics 25, 615-622. 2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing", Nucleic Acids Research, 25: 2745-2751. 3. Ewing, B., Hillier, L., Wendl, M., and Green, P., 1998, "Basecalling of automated sequencer traces using phred. I. Accuracy assesment", Genome Research 8: 175-185. 4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using phred. II. Error probabilities", Genome Research 8: 186-194. 5. Green, P., 1994, Phrap, unpublished. http://www.genome.washington.edu/UWGC/analysistools/phrap.htm 6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence finishing", Genome Research 8:195-202.