****************************
OUTPUT FROM PHRED -HELP
****************************
parameterargumentdefaultdescription
-----------------------------------
-if<filename>noneread input filenames from file
-id<dirname>noneread input files from <dirname>
-zd<dirname>pathuncompress program path
-zt<dirname>/usr/tmpuncompress temporary directory
-st<type>fastasequence file type (fasta|xbap)
-snonenofilewrite *.seq sequence file(s)
-s<filename>nofilewrite <filename> sequence file
-sa<filename>noneappend sequence files to <filename>
-sd<dirname>nofilewrite *.seq file(s) to <dirname>
-qt<type>fastaquality file type (fasta|xbap|mix)
-qnonenofilewrite *.qual quality file(s)
-q<filename>nofilewrite <filename> quality file
-qa<filename>noneappend quality files to <filename>
-qd<dirname>nofilewrite *.qual file(s) to <dirname>
-qr<filename>nofilewrite quality report to <filename>
-pnonenofilewrite *.phd.1 file(s)
-p<filename>nofilewrite <filename> phd file
-pd<dirname>nofilewrite *.phd.1 file(s) to <dirname>
-cv$lt;version>2SCF format version (2 or 3)
-cp<precision>maxvalSCF data precision in bytes (1 or 2)
-csnoneno scalealways scale traces in SCF files
-cnonenofilewrite * phred SCF file(s)
-c<filename>nofilewrite <filename> phred SCF file
-cd<dirname>nofilewrite * SCF file(s) to <dirname>
-dnonenofilewrite *.poly poly file(s)
-d<filename>nofilewrite <filename> poly file
-dd<dirname>nofilewrite *.poly file(s) to <dirname>
-raw<seq name>NULLseq name written in output files
-lognologwrite phred.log file
-nocallnonecalldisable basecalling
-trim<enzyme seq>notrimenable auto trim
-trim_alt<enzyme seq>notrimenable alternate auto trim
-trim_cutoff <n>0.05trim_alt error probability
-nonormnonenormalizedisable trace normalization
-nosplitnonenoneno compressed peak splitting
-nocmpqvnonenoneno compressed peak quality values
-ceilqv<ceiling qv>nonequality value ceiling value
-beg_pred<point>noneset peak prediction start point
-v<n>noneverbose operation = 1 to 63
-tagsnonenot tagslabel common messages with tags
-Vnonenoneshow version
-helpnonenonehelp
-hnonenonehelp
-docnonenoneshow phred documentation

For the warning messages `unable to identify chemistry and dye' and `unknown chemistry (...) in chromat ...' please read the phred documentation using the command `phred -doc'.

OUTPUT FROM PHRED -DOC

PHRED Documentation ------------------- 1. Introduction. Phred reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files. Phred can read trace data from SCF, ABI model 373 and 377 DNA sequencer chromatogram, and MegaBACE ESD chromatograms files, automatically detecting the file format, and whether the chromat file was compressed by gzip or UNIX compress. After calling bases, phred writes the sequences to files in either FASTA format, the format suitable for XBAP, PHD format, or the SCF format. Quality values for the bases are written to FASTA format files or PHD files, which can be used by the phrap sequence assembly program in order to increase the accuracy of the assembled sequence. Significant differences in this release New - new phredpar.dat file format identifies the sequencing machine type and recognizes `unknown' dye and chemistry values. - has four and five parameter quality value lookup tables calibrated with MegaBACE data. - reads MegaBACE processed ESD files. See the note entitled `ESD Files'. - phred writes trim information in phd files. - a `-tags' option causes phred to `tag' standard output and error in order to facilitate parsing. - a `-v' command line option for diagnostic output. - a `-trim_cutoff' option sets the baseline error probability for the calculations used with `-trim_alt' and the values stored in the phd file. - a `-doc' option prints this document to the screen. - a `-cs' option always scales traces written to SCF files. Modified - `-trim_alt' no longer tests the average quality value of the trimmed sequence. - set the quality value of `N's to zero - when reading traces from MegaBACE ESD files and writing SCF files, phred now always scales the trace for the SCF file so the maximum trace values equals the maximum value that can be stored in the SCF file given the requested (or default) storage precision. That is, if the precision is one, the maximum trace value in the SCF file will be 255, and if the precision is two, the maximum trace value will be 65535. - when writing SCF version 3.0 files with trace precision = 1 (-cv 3 -cp 1), phred used to always scale the trace to a maximum value of 255. It now scales the trace when the maximum trace value exceeds 255. This is the same behavior as for writing SCF version 2.0 files. You can use the `-cs' option in order to make it always scale the traces. Please note - the quality values that the `old' phred assigned to MegaBACE bases are too high. This version of phred, which has additional quality value lookup tables calibrated specifically for MegaBACE data, assigns more accurate quality values --- and they are generally lower. - the `MegaBACE Mobility File' entry in the phredpar.dat file now specifies `unknown' chemistry, rather than `primer' or `terminator' because some early MegaBACE software wrote `MegaBACE Mobility File' for the `primer ID' string in both primer and terminator chemistry ABD files. You may want to change this value if you process exclusively primer or terminator chemistry MegaBACE data; however, you must remember to change it if you decide to process different chemistry data from the MegaBACE later. 2. Acknowledgements. Phred benefits from ideas developed by LaDeana Hillier, Mike Wendl, Dave Ficenec, Tim Gleeson, Alan Blanchard, and Richard Mott. 3. Algorithms. Phred uses simple Fourier methods to examine the four base traces in the region surrounding each point in the data set in order to predict a series of evenly spaced predicted locations. That is, it determines where the peaks would be centered if there were no compressions, dropouts, or other factors shifting the peaks from their "true" locations. Next phred examines each trace to find the centers of the actual, or observed, peaks and the areas of these peaks relative to their neighbors. The peaks are detected independently along each of the four traces so many peaks overlap. A dynamic programming algorithm is used to match the observed peaks detected in the second step with the predicted peak locations found in the first step. Phred evaluates the trace surrounding each called base using four or five quality value parameters to quantify the trace quality. It uses a quality value lookup table to assign the corresponding quality value. The quality value is related to the base call error probability by the formula QV = - 10 * log_10( P_e ) where P_e is the probability that the base call is an error. Phred uses data from a chemistry parameter file called 'phredpar.dat' in order to identify dye primer data. For dye primer data, phred identifies loop/stem sequence motifs that tend to result in CC and GG merged peak compressions. It reduces the quality values of potential merged peaks and splits those peaks that have certain trace characteristics indicative of merged CC and GG peaks. In addition, the chemistry and dye information are passed to phrap. 4. Building and installing. The INSTALL file describes the steps for building and installing phred. Copy the phred parameter file, called 'phredpar.dat', to a directory that is accessible by phred users and set the environment variable 'PHRED_PARAMETER_FILE' to the full path name of the file. For example, if you copy 'phredpar.dat' to '/usr/local/etc/PhredPar' and you are using the C shell then issue the command % setenv PHRED_PARAMETER_FILE /usr/local/etc/PhredPar/phredpar.dat It is most convenient to set the environment variable in the system- wide shell startup (cshrc or equivalent) file. You can rename the phred parameter file but the PHRED_PARAMETER_FILE environment variable must reflect the new name. With Windows NT you give the command % set PHRED_PARAMETER_FILE=\usr\local\etc\PhredPar\phredpar.dat in the DOS command window in which you will run phred. Note: if you compile phred on a SUN Solaris OS using the BSD C compiler in the directory `/usr/ucb', you will find that the `-id' command line option fails (phred reports that it cannot read files, and it prints the name of each file it fails to read; however, the name it prints lacks the first few characters of the true name of the file). If this occurs, recompile phred using either the optional C compiler in the directory /opt/SUNWspro/bin or the GNU C compiler. 5. Running phred. Phred uses command line options to control input, processing, and output. The command line options are delimited by a dash, "-". The command line options are Input Options ------------- -id <directory name> Read and process files in <directory name>. -if <file name> Read and process files listed in the file <file name>. Each line in <file name> must specify a valid path to a single input file. -zd <directory name> Location of compression program. If -zd is omitted, phred uses the current path to search for the compression program. -zt <directory name> Directory where chromat is uncompressed. If -zd is omitted, phred uses /usr/tmp. When phred processes a compressed file, it uncompresses the chromat into this temporary directory before it reads the file. It subsequently deletes the uncompressed file in the temporary directory. Processing Options ------------------ -nocall Disable phred base calling and set the current sequence to the ABI base calls that are read from the input file. By default, the current sequence is set to the phred base calls. This affects the base trimming and output options. -trim <enzyme sequence> Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. In addition, <enzyme sequence> specifies a base sequence that is used to trim bases off the start of the current sequence. You can specify a NULL enzyme sequence using empty double quotes, "". See the note below on the effect of using the trim option. -trim_alt <enzyme sequence> Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. Specifically, for each base, the phred error probability is subtracted from the default value of 0.05 (or the value set using the `-trim_cutoff' option), and the resulting values are summed to find the maximum scoring subsequence. Furthermore, the subsequence must have a minimum number of bases. In addition, <enzyme sequence> specifies a base sequence that is used to trim bases off the start of the current sequence. You can specify a NULL enzyme sequence using empty double quotes, "". -trim_cutoff <value> Set trimming error probability for the `-trim_alt' option and the trimming points written in the phd files. The default value is 0.05. -nonorm Disable phred trace normalization. This option is not recommended unless the base caller fails due to huge noise peaks extending over a large region at the start of the trace, as is characteristic of some dye terminator reactions. -nosplit Disable compressed peak splitting. By default, phred identifies and splits C and G peaks that may be a merged pair of peaks. Phred searches for compression prone loop/stem sequence motifs and attempts to confirm a compression using characteristics of the trace, primarily the size of the candidate peak. -nocmpqv Force phred to use the four parameter quality values. By default, phred uses five parameter quality values for dye primer data (only) in order to reduce the quality values of merged CC and GG peaks. (Phred uses the four parameter quality values for dye terminator chemistry data automatically. If phred cannot determine the chemistry, it uses the four parameter quality values.) -ceilqv <ceil_qv> Specifies a maximum quality value assigned to bases. Bases with quality value parameters that correspond to quality values greater than <ceil_qv> are assigned the value <ceil_qv>. -beg_pred <trace_point> Specifies the trace point at which to begin the peak prediction. This point should be in a region of `good' trace where the peak spacing is even and representative of the peak spacing throughout the trace. In addition the peaks should be large and the noise low in the region, and the value of <trace_point> must not be within 100 points of the trace ends. Output Options -------------- -st fasta Set the output sequence file format to FASTA. (Default.) -st xbap Set the output sequence file format to XBAP. -s Write sequence output files with the names obtained by appending ".seq" to the names of the input files, and store them in the directory where phred is running. -s <file name> Write a sequence output file with the name <file name>. This option is valid for a single input file only. -sd <directory name> Write sequence output files with the names obtained by appending ".seq" to the names of the input files, and write them in the directory <directory name>. -sa <file name> Write a sequence output file in FASTA format with the name <file name>. The file contains the base calls of all the reads processed in this run of phred. -qt fasta Set the output quality file format to FASTA. Trimmed off base quality values are set to zero. (Default.) -qt xbap Set the output quality file format to XBAP. Trimmed off base quality values are omitted. -qt mix Set the output quality file format to FASTA. Base quality values for all bases are written (including those for trimmed off bases). -q Write quality output files with the names obtained by appending ".qual" to the names of the input files, and store them in the directory where phred is running. This option is valid for FASTA format output files only. -q <file name> Write a quality output file with the name <file name>. This option is valid for a single input file and a FASTA format output file only. -qd <directory name> Write quality output files with the names obtained by appending ".qual" to the names of the input files, and store them in the directory <directory name>. -qa <file name> Write a quality output file in FASTA format with the name <file name>. The file contains the quality values of all the reads processed in this run of phred. -qr <file name> Write a histogram of the number of high quality bases per read. This is meaning- ful when phred processes more than one read. -c Write SCF files with the trace data, the base calls of the current sequences, and the positions of the base calls. The SCF files have the names of the input files (phred will refuse to write the SCF file if you ask it to write the SCF file in the directory in which the input file resides). -c <file name> Write an SCF file with the trace data, the base calls of the current sequence, and the positions of the base calls. The SCF file has the name <file name>. This option is valid for a single input file only. -cd <directory name> Write SCF files with the trace data, the base calls of the current sequences, and the positions of the base calls. The SCF files are written in the directory <directory name> and have the same names as the input files. -cp <number of bytes> Store SCF trace data as 1 or 2 byte values. Defaults to 1 when the maximum trace value is less than 256, or to 2 when the maximum trace value is greater than or equal to 256. This is the trace precision. -cs Always scale traces before writing them to an SCF output file. This ensures that the largest trace value has the largest value that can be stored in the SCF file. When the file trace precision is `1', the maximum value is 255, and when the precision is 2, the maximum value is 65535. Without this option, phred does not scale the trace unless (a) the trace was read from an ESD file or (b) the maximum trace value exceeds the value that can be stored in the SCF file at the precision used. Trace scaling ensures the maximum digital resolution for a given storage precision but it will make a uniformly low level trace appear to be a high level. -p Write a PHD file, which is used by the consed editor to display bases. A PHD file contains a set of comments used by consed for maintaining consistency between the chromat file, the .ace file and the PHD file, and it contains base data as triples consisting of the base call, quality, and position. Phred always writes the first version of the PHD file for a read, which has the name <filename>.phd.1. When a read is edited using consed, a new version of the phd is written by consed, for example, the second version has the name <filename>.phd.2. With the -p option, <filename> is the name of the input file. -p <filename> Write a PHD file with the name <filename>.phd.1. This option is valid for processing a single input file. -pd <directory name> Write PHD files in directory <directory name>. The PHD files have the names <filename>.phd.1 where <filename> is the name of the input file. -d Write a data file that is used for detecting polymorphic bases. The file has the name <filename>.poly where <filename> is the name of the input file. The first line of the file consists of the sequence name, the smallest amplitude normalization factor, and the amplitude normalization factors for the A, C, G, and T traces. One line for each called base follows the header line. The information on each line consists of the called base, the position of the called base, the area of the called peak, the relative area of the called peak, the uncalled base, the position of the uncalled base, the area of the uncalled base, the relative area of the uncalled base, and the amplitudes of the four traces at the position of the called base. -dd <dirname> Write polymorphism data files in directory <directory name>. The files have the names <filename>.poly where <filename> is the name of the input file. -raw <sequence name> Write <sequence name> in the header of the sequence output file and the quality output file. By default, the name of the input file is written in the headers of these files. This option is valid for a single input file only. -log Make phred append a log entry describing the processing run in the file "phred.log". Miscellaneous ------------- -v <n> Verbose operation. You can control the level of verbosity with <n>, which ranges from 1 to 63. -tags Label common output with tags in order to facilitate output parsing. -h, -help Display a command line option summary. -doc Display phred documentation. -V Display phred version. Examples -------- If you plan to use phred base calls and base quality information as input to the phrap assembly program and to the consed finishing program, simply follow the documentation supplied with consed and then type: phredPhrap (with no arguments) If you intend to use consed, you *MUST* use this perl script. Failure to use this script will result in many consed features not working correctly, including consed's autofinish function, user-defined consensus tags, tagging ALU and other repeats, and tagging vector sequence. Use the phredPhrap perl script. An outline of the important processing steps performed by the script follows. Let us say you want to call bases from the chromat files in subdirectory "chromat_dir", use phrap to assemble the contigs, and run consed to edit/examine the contigs. In this case you must ask phred to create "phd" output files, which are required by consed. It runs phred with the options % phred -id chromat_dir -pd phd_dir which causes phred to read the chromat files in "chromat_dir" and write the "phd" files to "phd_dir". Next it makes FASTA files from the "phd" files by running the phd2fasta program. For example, % phd2fasta -id phd_dir -os seqs_fasta -oq seqs_fasta.screen.qual Subsequently it screens out the vector in the sequences in "seqs_fasta" using cross_match: % cross_match seqs_fasta vector.seq -minmatch 12 -minscore 20 -screen > screen.out which generates the screened sequence file "seqs_fasta.screen", It runs phrap to perform the sequence assembly as follows: % phrap seqs_fasta.screen -new_ace > phrap.out Phrap writes the the assembled contigs to the file "seqs_fasta.screen.contigs", and creates a .ace file that can be used for importing the assembly to xbap, consed, or ace-mbly for editing. As another example, again you want to process the chromat files in subdirectory "chromat_dir", but now you want phred to write the base calls to a FASTA file named "seqs_fasta" and the base quality values to "seqs_fasta.qual". In this case you run phred with the options % phred -id chromat_dir -sa seqs_fasta -qa seqs_fasta.qual We recommend that you not use the trim option. Inaccurate bases called near the ends of the traces will not interfere with proper phrap assembly. Refer to the file "phrap.doc", which is part of the phrap distribution, for information on cross_match and phrap. Return values ------------- Phred returns 0 for successful processing and for file read errors. It returns -1 for processing errors and file write errors. Phred continues processing on file read and write errors but halts on serious processing errors. 6. Phred parameter file Phred reads the `primer ID' information in the chromatogram and it tries to find the same name in the phred parameter file, which is described in the `Building and installing' section above. If it succeeds, the phredpar.dat entry for the `primer ID' identifies the sequencing reaction chemistry (primer or terminator) and the type of dye. If it cannot find the `primer ID' information in the chromatogram, it reports no dye primer ID in chromat yyyy where yyyy is the chromatogram name. If it cannot find the `primer ID' name in phredpar.dat (or it cannot find the phredpar.dat file), it reports unknown chemistry (xxxx) in chromat yyyy add a line of the form "xxxx" <chemistry> <dye type> <machine type> to the file zzzz type `phred -doc' for more information where xxxx is the `primer ID' and yyyy is the chromatogram name. Add the indicated line to phredpar.dat. Phred reads the `PHRED_PARAMETER_FILE' environment variable in order to find the phredpar.dat file. If this is not set on your system, phred reports warning: 'PHRED_PARAMETER_FILE' environment variable not set: unable to identify chemistry and dye type `phred -doc' for more information If the `PHRED_PARAMETER_FILE' environment variable is set incorrectly, that is, phred cannot find the phredpar.dat file there or the file is not valid, phred reports readParamFile: warning: unable to open file zzzz warning: processing without phred parameters where zzzz is the value of the PHRED_PARAMETER_FILE environment. It processes the chromatograms but warns that it could not read the parameter file as it processes each chromatogram as explained above. In this case, you must set PHRED_PARAMETER_FILE to a valid name. In these three cases phred processes the chromatogram but it uses the default (ABI) four parameter quality values, does not try to split compression peaks, and reports the chemistry and dye types to phrap as `unknown'. If you use a `primer ID' for your reactions that is not in phredpar.dat, you can add the `primer ID' name to phredpar.dat. You will need to know the `primer ID' name as it is stored in the chromatograms, the chemistry type (primer or terminator), the dye name, and the type of sequencing machine. Use a text editor to add `primer ID' entries to phredpar.dat. You will find additional information about the form of phredpar.dat entries in phredpar.dat. The columns in phredpar.dat have the form column value name ------ ---------- 1 primer identification string 2 chemistry 3 dye 4 sequencing machine type where the column values are separated by spaces or horizontal tabs. The values phred recognizes are value name values ---------- ------ primer id. string primer name enclosed in double quotes chemistry primer, terminator dye rhodamine, d-rhodamine, big-dye, energy-transfer, bodipy sequencing machine type ABI_373_377, MolDyn_MegaBACE, ABI_3700 7. Notes Sequence Trimming ----------------- The trimming options are not intended for use in phrap assemblies so when phred produces FASTA sequence output files with those options, the low quality sequence remains in the output file. However, the trim information is stored in the FASTA header so that the user can remove the low quality sequence. In contrast, when you ask phred to create a "xbap" style sequence output file, the low quality sequence is trimmed off and placed in comment fields within the file. The FASTA header, as written by phred, contains the following fields >chromat_name 1323 15 548 ABI where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF. ESD Files --------- Phred reads processed MegaBACE ESD files. It cannot read the raw ESD files. It is important that you identify the dye chemistry correctly when you run the MegaBACE base caller so that phred can assign the right base to each trace. (This is important with ABI data too.) In order to obtain the best phred quality value accuracy with MegaBACE data, phred must use the quality value lookup tables designed for this data. Phred identifies the sequencing machine by reading the `primer ID' string in the chromatogram and matching it with an entry in the phredpar.dat file. The matching entry lists the chemistry, dye, and sequencing machine types. For example, the `primer ID' string of the form `ET Primer' identifies a chromatogram as ET dye primer data generated on a MegaBACE sequencing machine. You can check that phred interprets the `primer ID' string correctly by using the `-v 63' option to have phred write diagnostic information to the screen. 8. References Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. 1998. Genome Research 8:175-185. Brent Ewing and Phil Green Base-calling of automated sequencer traces using phred. II. Error probabilities. 1998. Genome Research 8:186-194.