/****************************************************************************** * * * Program: Refcomp * * Version: 1.0 * * Copyright (C) 1996-1997 * * by Deborah A. Nickerson, Natali Kolker, Scott Taylor, and Mark Rieder * * University of Washington * * * * All rights reserved. * * * * This software may not be redistributed, distributed * * in modified form, or used for any commercial purpose, including * * commercially funded sequencing, without written permission from * * the authors and the University of Washington. * * * * This software is provided ``AS IS'' and any express or implied * * warranties, including, but not limited to, the implied warranties of * * merchantability and fitness for a particular purpose are disclaimed. * * In particular, this disclaimer applies to any diagnostic purpose. In no * * event shall the authors or the University of Washington be liable for * * any direct, indirect, incidental, special, exemplary, or consequential * * damages (including, but not limited to, procurement of substitute goods * * or services; loss of use, data, or profits; or business interruption) * * however caused and on any theory of liability, whether in contract, * * strict liability, or tort (including negligence or otherwise) arising * * in any way out of the use of this software, even if advised of the * * possibility of such damage. * * * ******************************************************************************/ DESCRIPTION ----------- Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations. In addition, the typing of SNPs also plays a role in identifying mutated oncogenes, genetic and infectious diseases, in matching tissues prior to transplantation, and in forensic and paternity testing. Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, and direct sequencing of the PCR product. Of these methods, direct DNA sequencing is the most accurate and the most automated approach for scanning DNA fragments for variation. Furthermore, it is the only method that provides complete information about the location and nature of the sequence variation using a single set of reagents and assay conditions. Despite the advantages of sequencing PCR products to identify DNA variations, there is one drawback: it is difficult to accurately identify heterozygous sites within a sequence. By comparing sequence traces containing variant sites for homozygotes and heterozygotes we have noted two consistent changes: 1) a significant drop in fluorescence peak height at a variant site when sequence traces obtained from homozygous individuals are compared to traces from heterozygous individuals, and 2) the presence of a second fluorescence peak in sequence traces from heterozygous individuals (1, 2). We have developed a program known as PolyPhred to scan for these two features when sequence traces are being compared (2). Refcomp (6) was designed to analyze sequencing traces which contains data from strictly homozygous samples (eg. cloned DNA, mitochondrial DNA, etc.). This data represents a special case which can be analyzed for mismatches with a known reference sequence. Refcomp will determine the high quality positions within an assembled DNA contig and produce a report listing these sites. HOW REFCOMP WORKS ------------------- Refcomp is designed as a member of an integrated suite of sequence analysis applications which includes Phred (3,4), Phrap (5), and Consed (6), and is not a stand alone program. Phred provides the base-calls, base-call quality information and the peak size information. Phrap is used to assemble the input sequences into one or more contigs. Refcomp will parse contig alignment data files (.ace files), produced by the program Phrap, and identify sites in the consensus sequence which differ from a defined reference sequence. Refcomp will generate data tags for the mismatched sites and the Consed program is used to visually review the Refcomp output. The Refcomp output is also written to a file in a format that can be easily parsed into a database program. COMMAND LINE ------------ 1) -ace [name of ace file] The user must supply the name of the ace file containing the assembly information for the assembly to be scanned. Required. 2) -d [directory name] The user can supply the name of directory. Optional. Default "../" 3) -quality [number between 0 and 50] The Phred quality value used to determine search limits and to limit the base mismatches reported in the output. Optional. Default 20. ------ OUTPUT ------ Refcomp writes output both to the standard output port and the standard error port. The standard output contains information about the analysis such as the location of a putative polymorphism and the putative genotypes for each sample. The standard error contains messages about PolyPhreds progress. Generally the standard output is redirected to a file and the standard error is allowed to be printed on the screen. In both outputs, information is printed between beginning and ending tokens to make parsing of the file easy. STANDARD OUTPUT --------------- 1) Prints out the command line used for the run. If optional parameters were not supplied on the command line then the defaults are printed. The associated token is 'COMMAND_LINE'. 2) For each contig, prints the contig name. The associated token is 'CONTIG'. 3) For each mismatched site, Refcomp prints the position in designated reference sequence position (R. pos.), consensus sequence position (C.pos.), the base found in the reference sequence (R. base), the base found in the consensus sequence (C. base), the consensus sequence phrap quality (C. Quality), a flag for opposite strand confirmation (1 = confirmation, 0 = no confirmation, and a flag designating whether a forward or reverse read was found at that position (1 = read present, 0 = no read present). The associated token is 'SITES'. 4) For each mismatched position, Refcomp prints the name of the sequence chromatogram covering that site. Additional information is given in the following order: reference sequence position (R. pos), consensus sequence position (C. pos), chromatogram name, base found in the read (Base), read phred quality associated with the base at this position (Q. phred) and phrap quality at this position in the consensus sequence. The associated token is 'READS'. 5) Finally, Refcomp reports the total number of sites in found which differ from the reference sequence. The associated token is TOTAL_SITES. Standard ERROR --------------- 1) First prints out header information which includes the version number, a unique integer for identifying the Refcomp output (the same integer is used to uniquely identify the standard error), and the time of the execution. The token used to separate these lines from the remainder of the output is 'HEADER'. 2) Next prints out the command line used for the run. If optional parameters were not supplied on the command line then the defaults are printed. The associated token is 'COMMAND_LINE'. 3) Then, for each contig, prints various messages which report on the progress of the run. The associated token is 'MESSAGE'. USING YOUR OWN DATA ------------------- These instructions will allow you to take chromatograms, analyze them with Phred and Phrap and run Refcomp on the .ace file. 1) Install Refcomp (see README and INSTALL) 2) Create the following directory structure in your working directory ./chromat_dir ./edit_dir ./phd_dir 3) Move your chromats into 'chromat_dir' 4) Put your reference sequence .phd file into 'phd_dir' For more information on creating a "reference sequence" see the notes below. 5) Change directory into 'edit_dir' 6) Type 'phredPhrapRef' You could also run Refcomp as a stand alone product 7) Run consed. EXAMPLES : ------------ /home/gene ------|----edit_dir--|--- gene.fasta.ace |----chromat_dir |----phd_dir You can run Refcomp from any directoty by typing : 1) refcomp -d /home/gene -ace gene.fasta.ace > refcomp_output or cd /home/gene/edit_dir/ refcomp -ace gene.fasta.ace > refcomp_output 2) If you would like to run Refcomp with a different quality threshold you may use the command: refcomp -ace gene.fasta.ace -quality 20 > refcomp_output NOTES ON ANALYSIS ----------------- The program Refcomp relies on the comparison of the consensus sequence generated by Phrap and a "reference sequence" which assembles in the same contig. In order to create a reference sequence you need the sequence you are interested in. This sequence can then be formatted to look like a .phd file created by the program Phred. The reference sequence file is placed in the phd_dir and after running Phrap will be included in the assembly and should be able to be viewed in Consed. ** Note: In order to be recognized as a reference sequence the name of the file must contain the character string "ref" (they can be upper/lower/mixed case) in its name. Examples of other reference sequences for the mitochondrial genome can be found at http://droog.mbt.washington.edu Refcomp tags all mismatched sites with the specific "polymorphism" tag found in Consed. This enables the users to easily and rapidly navigate between sites and allows for easy verfication by the data analyst. Refcomp has a default setting of quality 20 for tagging mismatched sites. The quality setting has been used in previous studies for finding homoplamic polymorphisms in the mitochondrial genome with single pass sequence coverage (6). REFERENCES ---------- (1) Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994, "Comparative analysis of human DNA variations by fluorescence- based sequencing of PCR products", Genomics, 25, 615-622. (2) Nickerson, D.A., Tobe, V.O., Taylor, S.L., 1997 "PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing" Nucleic Acids Res. 25(14): 2745-2751. (3) Ewing, B. and Green, P., 1992, Phred, unpublished. http://www.genome.washington.edu/ (4) Green, P., 1994, Phrap, unpublished. http://www.genome.washington.edu/ (5) Gordon, D., 1995, Consed. unpublished. http://www.genome.washington.edu (6) Rieder, M.J., Taylor, S.L., Tobe, V.O., and Nickerson, D.A., 1998 "Automating the identification of DNA variations using quality-based fluorescence re-sequencing: analysis of the human mitochondrial genome", Nucleic Acids Res. 26(4):967-973.