PROGRAMS: Programs located in /usr/local/genome/bin which are used in SNP detection and gene analysis. This is not a comprehensive list, but contains those programs most often used in analysis.



2FOF.PL: THIS PROGRAM IS USED TO CREATE A FOF FILE THAT IS THEN USED BY CONSED TO ADD NEW READS. IT TAKES AN ACE FILE AS INPUT AND THE OUTPUT FILE IS SPECIFIED BY THE USER - TYPICALLY [DATE].FOF

COMMAND LINE PARAMETERS:

·        -ace [ace filename]

·        -out [filename.fof]

 

2HAP: A WRAPPER FOR PHYLIP (PHILOGENETIC INFERENCE PROGRAMS) PROGRAMS, WHICH INFER HAPLOTYPES. 2HAP USES AN ALGORITHM TO "SUBTRACT" OUT THE COMMON PATTERN TO "DEDUCE" THE UNCOMMON PATTERN. THIS PROGRAM CREATES MANY OUTPUT FILES.

 

CHANGESEQEVENTID.PL: OCCASSIONALLY SAMPLE NAMES WILL BE INCORRECT DUE TO SAMPLE HANDLING ERRORS, BUT MOST OF THE TIME IT IS NOT CAUGHT UNTIL AFTER THE ANALYST ATTEMPTS TO ADD IT TO THE ASSEMBLY. IN THIS CASE, THE MOVECHROMATS.PL PROGRAM HAS ALREADY MODIFIED THE NAME AND POPULATED THE DATABASE. ANALYSTS MUST CORRECT THE CHROMAT NAMES AND THEIR ASSOCIATION TO A SEQUENCING EVENT, WHICH INVOLVES RENAMING THE CHROMATS TO THE CORRECT CHROMAT NAME WITH THE CORRECT GENE NAME, PRIMER NAME, SAMPLE NAME, DIRECTION, CHEMISTRY, ETC. THIS PROGRAM IS USED TO CHANGE THE SEQ_EVENT_ID OF CHROMATS THAT HAVE TO BEEN MISNAMED TO THE NEW SEQ_EVENT_ID OBTAINED THE THE "DUMMY" SAMPLE SHEET.

COMMAND LINE PARAMETERS:

·        -chromat - the directory path name where the mishandled chromats are located

·        -new_id - the file name of the "dummy" sample sheet

 

COMPARE_GENOTYPE.PL: COMPARES TWO PRETTYBASE FILES CREATED FROM TWO DIFFERENT POLYPHRED OUTPUT FILES, USING DIFFERENT QUALITY PARAMETERS DURING THE POLYPHRED RUNS.

COMMAND LINE PARAMETERS:

·        -lq [lower quality polyphred output file]

·        -hq [higher quality polyphred output file]

 

COMPLEMENT.PL: REVERSES AND COMPLEMENTS THE GIVEN INPUT FASTA SEQUENCE. THIS PROGRAM WILL STRIP THE HEADER FROM THE ORIGINAL FASTA FILE, SO YOU MUST ADD A NEW HEADER TO THE REVERSE COMPLEMENTED OUTPUT FILE. THIS PROGRAM CAN ALSO BE USED WITHOUT SPECIFYING AN OUTPUT FILE TO SIMPLY GET THE COMPLEMENT OF ANY SEGMENT FOLLOWING THE HEADER IN THE INPUT FILE.

COMMAND LINE EXAMPLES:

·        complement.pl -file [fasta filename] -header [test string that should follow] > [outputfile.fasta] OR

·        complement.pl -file [fasta filename]

 

CONSED: GRAPHICAL USER INTERFACE CREATED BY DAVID GORDON - CONTAINS MANY FUNCTIONS. SEE CONSED DOCUMENTATION FOR MORE DETAILS.

 

CONSED_EDIT: RUNS CONSED, BUT ALSO COPIES LOCAL CHANGES INTO GLOBAL DIRECTORY.

 

CREATEHAPOUTPUT.PL: THIS PROGRAM IS DESIGNED TO CREATE 'PRETTYBASE' FORMATTED FILES USING PHASE OUTPUT FILES; THE OUTPUT IS SORTED BY FREQUENCY AND SAMPLE FOR THE INDIVIDUAL HAPLOTYPE FILES AND BY SAMPLE FOR THE COMBINED POPULATION FILES.

COMMAND LINE EXAMPLES:

·        createHapOutput.pl -lab_symbol [lab_symbol]



CROSS_MATCH: A PROGRAM THAT SEARCHES FOR PATTERNS BETWEEN TWO FILES. IF THIS FILE DOES NOT PRODUCE ANY DIFFERENCES, THE MINMATCH AND MINSCORE CAN BE ADJUSTED, TYPICALLY LOWERING THE VALUE FROM THE DEFAULT TO TEN WILL PRODUCE MORE RESULTS. FOR FEWER MATCHES, RAISE THE MINMATCH OR MINSCORE.

COMMAND LINE EXAMPLES:

·        cross_match -alignments [filename.1.fasta] [filename.2.fasta] > [output.xm]

·        cross_match -alignments -minmatch [10] -minscore [10] [filename.1.fasta] [filename.2.fasta] > [output.xm]

 

DBSNP.PL: THIS PROGRAM FORMATS POLYMORPHIC SITES FOR DBSNP SUBMISSION. IT CAN ALSO BE USED TO HELP SELECT OLA PRIMERS.

COMMAND LINE PARAMETERS:

·        -reference [reference file]

·        -frequency [SNP freq. file]

·        -gene [gene abbreviation]

·        -accession [accession reference number]

·        -window [nucleotides on either side of site]

 

DBSNP_SUBMIT.PL: THIS PROGRAM FORMATS POLYMORPHIC SITES FOR DBSNP SUBMISSION. IT SHOULD BE USED FOR DATA SUBMISSION FOR BOTH THE EGP AND PGA PROJECTS.

COMMAND LINE PARAMETERS:

·        -lab [lab symbol]

·        -hugo [hugo symbol]

·        -handle [EGP_SNPS OR PGA-UW-FHCRC]

·        -accession [genbank accession reference number]

·        -population [AD, ED, PDR90]

·        -locusnumb [locuslink identification number]

·        -project [egp OR pga]

 

DRAWMAP: A PROGRAM THAT "DRAWS" A FEATURE MAP WHICH CAN INCLUDE REPEATS, EXONS, PRODUCTS (PRIMERS) AND VARIATIONS. TYPES OF FILES INCLUDE CROSS_MATCH AND TABLE FILES. FOR FURTHER INFORMATION OR EXAMPLES, SEE ONE OF THE ANALYSTS.

OPTIONS: -help provides more information

·        -beg [start of graph (bp)] - this is a value where the graph should begin (typically 0)

·        -end [end of graph (bp)] - this is a value where the graph should end

·        -scale [bp per centimeter] - this parameter may need to be played with to find the desired result

·        -list [list filename] - a filename containing the desired information can be specified (examples of such a file are available). the list file contains a table with each line containing file (path), filetype, graphtype, and file object fill (color, 1, or 0) - the elements must be separated by spaces

·        -outfile [filename.ps] - specify the name of the output file

·        -gt [graph type] - example: 3d_histogram

·        -ft [file type] - example: cross_match

·        -fill [type of fill] - this can be black and white or color

 

EDITFASTA: THIS PROGRAM TAKES THE SEQUENCE FILE IN FASTA FORMAT, THE BEGINNING AND END POSITIONS AND WRITES A NEW FASTA SEQUENCE TO THE STANDARD OUTPUT (REDIRECT THE OUTPUT TO A FILE IF NECESSARY).

COMMAND LINE PARAMETERS:

·        -fasta[sequence file in fasta format]

·        -beg [start position]

·        -end [end position]

 

FASTA2GIBCO.PL: THIS PROGRAM TAKES A FASTA FILE AND FORMATS FOR THE GIBCO PRIMER ORDERING SYSTEM ON THE WEB - THE OUTPUT SHOULD BE REDIRECTED TO AN OUTPUT FILE.

 

FASTA2OPERON.EMAIL.PL: THIS PROGRAM CREATES AN OUTPUT FILE THAT CAN BE EMAILED TO OPERON FOR PRIMER ORDERS. OUTPUT FILE CAN BE EMAILED TO dna@operon.com AS AN ATTACHMENT.

COMMAND LINE PARAMETERS:

·        -file [fasta file listing the primers to be ordered]

·        -format [tube or plate]

·        -po_numb [po number for order, currently 439625 for pga]

 

FASTA2PHD.MANYCONTIGS.PL: THIS PROGRAM TAKES A MULTIPLE ENTRY FASTA FILE AND CREATES A PHD FILE FOR EACH ENTRY - OUTPUT SHOULD BE REDIRECTED TO AN OUTPUT FILE.

COMMAND LINE EXAMPLE:

·        ref_fasta2phd.pl [fasta reference filename] > [base_name phd output file]

 

FASTA_SPLIT.PL: THIS PROGRAM TAKES A FASTA FILE FROM THE DRAFT GENOME SEQUENCE WITH CHARACTER DIVIDERS BETWEEN CONTIGS AND CREATES A FASTA FILE WITH A SINGLE ENTRY FOR EACH CONTIG. THE OUTPUT SHOULD BE REDIRECTED TO AN OUTPUT FILE.

COMMAND LINE EXAMPLE:

·        fasta_split.pl [fasta reference filename] > [output filename]

 

FINALIZE_WEB_DATA.PL: THIS PROGRAM IS DESIGNED TO FINALIZE THE DATA IN STATS_DIR FOR POSTING ON THE WEB. IT FORMATS THE GENERAL OUTPUT FILES AND MUST BE RUN IN THE <PROJECT>/DATA/<GENE> DIRECTORY FOR EACH COMPLETED GENE.

COMMAND LINE PARAMETERS:

·        -lab [lab symbol]

·        -hugo [hugo symbol]

·        -project [egp or pga]

·        -genbank [genbank accession number] re-run program with this flag AFTER data is posted as one of the files is a link to the actual genbank posting.

 

FINISH_GENE.PL: THIS PROGRAM IS USED TO FINISH A GENE AND FORMAT THE OUTPUT FILES. IT USES TRANSLATE_NEW.PL FOR PROTEIN AND CSNP TRANSLATION.

COMMAND LINE PARAMETERS:

·        -lab [lab_symbol]

·        -hugo [hugo symbol]

·        -drive [drive for data, example: c16]

·        -startpos [starting position, only necessary to force the program to choose a specific start]

 

GA.PL: THIS PROGRAM IS A MODIFIED VERSION OF GENE_ANALYSIS AND IS USED TO QUERY THE ORACLE DATABASE, CREATING VARIOUS OUTPUT FILES INCLUDING ALLELES.TXT, PRETTYBASE.TXT, FREQ, ETC.  THE DATABASE MUST HAVE BEEN POPULATED USING PANALYSISTABLES.PL.

COMMAND LINE PARAMETERS: all required except -sample

·        -polyphred OR -poly_exp_id [either the polyphred output file OR the polyphred experiment id]

·        -database [egp OR pga]

·        -output [prefix to attach to output files - typically <gene><date> OR <gene>_<date>]

·        -lab_symbol

·        -population [/c16/pga.samples OR /ff1/EGP/egp_samples]

·        -hugo [hugo symbol]

·        -pwd [password for accessing the database]

·        -sample [OPTIONAL] - specifications for parsing the file, the default is the method used by the pga, but if chromat names vary with other projects, this value will have to be supplied by the user.

 

GC_CALC.PL: THIS PROGRAM CALCULATES THE GC CONTENT (%) IN A FASTA FILE OR OVER A SPECIFIC BASEPAIR RANGE. THE OUTPUT SHOULD BE REDIRECTED TO A FILE.

COMMAND LINE EXAMPLE:

·        gc_calc.pl [fasta input file] > [output filename]

 

GENE_ANALYSIS: THIS PROGRAM CONBINES POLYPHRED2DB, SQ4POLYPHRED AND SITE_STATS, USING THE SAME INPUT AS POLYPHRED2DB.

COMMAND LINE PARAMETERS: -polyphred and -output are required

·        -polyphred [name of polyphred output file] - remember to rerun polyphred to get the latest output information

·        -output [prefix] - this is appended to all created files - this parameter must begin with a letter and not a number

·        -database [name of postgres database to use]

·        -table [table name] - usually the same as the output prefix - this parameter must begin with a letter and not a number

·        -sequence [sequence] in fasta format to map sites to - typically this is the genbank reference sequence

·        -population [file containing specific samples listed down in one column]

·        -query [specify postgres query options]

·        -sample - regular expression used to separate sample name from file name - example '(\w{5})(\w{3})' - this will vary depending on the format of the filenames - the example {5} is for a two letter gene abbreviation and a three digit (alpha-numeric) sample name.

 

INDEL_SITE_MODIFICATION.PL: THIS PROGRAM IS DESIGNED TO CHANGE COMMENT TAGS ASSOCIATED WITH DIALLELIC INDELS AND MUST BE RUN IN THE EDIT_DIR. THE COMMENT TAGS CONTAIN THE CONSENSUS SITE VALUE, WHICH MAY CHANGE AFTER RE-PHREDPHRAPING THE GENE OR AFTER MODIFYING THE CONSESUS SEQUENCE. THE PRODUCT IN WHICH THE INDEL IS LOCATED IS NECESSARY. THE PRODUCT2 AND PRODUCT3 FLAG CAN BE USED IF THE INDEL SPANS MULTIPLE PRODUCTS.

OPTIONS: all are required except -product2 and -product3

·        -gene [lab gene name]

·        -product [product in which indel is located] - this value must include any preceding 0s (i.e. 001) [-product2 and -product3 follow the same pattern]

·        -old_site [incorrect site listed in indel tags]

·        -new_site [correct consensus position]

LDE++: VISUAL REPRESENTATION OF THE .LDE.TXT FILE WHICH IS CREATED WHEN SQ4POLYPHRED IS RUN. RED SITES ARE SITES THAT ARE IN LINKAGE DISEQUILIBRIUM.

 

MAKE_FINALREF.PL: CREATES A FINAL REFERENCE SEQUENCE FROM THE CONSENSUS SEQUENCE OF THE FINAL ASSEMBLY. THE NEW FINAL REFERENCE SEQUENCE WILL BE UPDATED WITH ALLELE FILES.

OPTIONS: all are required except navigator1 and navigator3

·        -navigator [gene.list created by pga_gene_analysis]

·        -contig [gene.contigs]

·        -allele [gene.alleles.out]

·        -hugo [hugo symbol for gene]

·        -navigator1 [gene.diallelic.list]

·        -navigator3 [gene.manualPoly.list]

 

MAKE_GENBANKTABLE.PL: THIS PROGRAM CREATES A FEATURE TABLE FOR A GENBANK FILE.

COMMAND LINE PARAMETERS:

·        -xm_repeat [gene.repeats.xm]

·        -frequency [gene.freq.txt]

·        -hugo [hugo symbol]

 

MAKEPRETTYBASEFILE.PL: THIS PROGRAM IS USES THE LAB_SYMBOL.DATABASE.TXT FILE (CREATED BY QUERYGENOTYPE.PL) TO CREATE A PRETTYBASE FILE. THE OUTPUT SHOULD BE REDIRECTED TO AN OUTPUT FILE NAMED LAB_SYMBOL.PRETTYBASE.TXT.

COMMAND LINE PARAMETERS:

·        -db [lab_symbol.database.txt]

·        -population - file containing the sample names, one sample per line [/c16/pga.samples for PGA]

 

MKTRACE: CREATES A .PHD FILE AND A CHROMAT FILE FROM A GIVEN SEQUENCE - THIS IS TYPICALLY USED WITH REFERENCE SEQUENCES OBTAINED FROM GENBANK BECAUSE THE .PHD FILE AND CHROMAT FILE DO NOT EXIST.

 

MOVECHECKEDCHROMATS.PL: THIS PROGRAM MOVES CHROMATS FROM THE CHROMAT_DIR TO EITHER BAD_CHROMATS OR SAVED_CHROMATS AND REMOVES THE ASSOCIATED PHD AND POLY FILES DEPENDING ON THE STATUS FLAG. FOR EXAMPLE, IF CHROMATS COULD NOT BE ADDED USING ADDNEWREADS, THEY WOULD NEED TO BE UPDATED TO SATUS 'BAD' WHEREAS IF A PRIMER WAS FOUND TO BE ALLELE SPECIFIC THE STATUS OF THE CORRESPONDING CHROMATS WOULD NEED TO BE UPDATED TO 'SAVED'. IT CAN EITHER TAKE A FILE CONTAINING CHROMAT NAMES AS INPUT, USEFUL IN THE CASE OF CHROMATS THAT COULD NOT BE ADDED THROUGH ADDNEWREADS (JUST SAVE THE CONSED OUTPUT LIST) OR IT CAN CREATE A LIST USING THE -GENE AND -CHROMAT FLAG WHERE -CHROMAT IS ACTUALLY THE PRODUCT OR PRIMER THAT NEEDS TO BE REMOVED.

COMMAND LINE PARAMETERS: either -chromat_list OR -product must be specified.

·        -status [bad or saved]

·        -chromat_list [file containing list of chromats to move or product/primer to move] - this file should contain a list of chromats to be modified

·        -product [productor primer to be removed, including gene name] - example: SMPT10221 or SMPT1022

 

MOVECHROMATS.PL: MOVE NEW CHROMATS FROM THE NEW_CHROMATS DIRECTORY TO THE APPROPRIATE GLOBAL OR CHIMP CHROMAT_DIR UNDER GENE DIRECTORIES.

COMMAND LINE PARAMETERS:

·        -database [egp OR pga]

·        -chimp_list [/c16/chimp_list]

 

NICE_PRINT_FASTA.PL: FORMATS A FASTA FILE TO BE PRINTED IN 60bp ROWS WITH SPACES EVERY 10bp, WHICH CORRESPONDES TO THE GENBANK FORMAT. REDIRECT THE OUTPUT TO A FILE.

COMMAND LINE EXAMPLE:

·        nice_print_fasta.pl [fasta filename]

 

PANALYSISTABLES.PL: THIS PROGRAM IS DESIGNED TO POPULATE THE FOLLOWING TABLES IN THE ORACLE DATABASE: POLYPHRED_EXP, CONTIG, CONTIG_CONSENSUS_SEQ, CONTIG_CHROMAT, POLYSITE_IN_POLYPHRED_EXP, POLYMORPHIC_SITE, PIPE_GENO, CHROMAT_GENOTYPE.

COMMAND LINE PARAMETERS:

·        -polyphred [polyphred output file]

·        -lab_symbol [five character gene code used by Nickerson lab]

·        -exp_status [initial, cleanup or final]

 

PB2DB: PRETTYBASE TO DATABASE.

COMMAND LINE PARAMETERS:

·        -database [name of postgres database to use]

·        -pb [prettybase filename]

·        -table [new table name] - this specifies the name of the new table that will be created

 

PCR_OVERLAP.PL: THIS PROGRAM CHOOSES SETS OF PRIMERS TO MAKE OVERLAPPING PCR PRODUCTS.

COMMAND LINE PARAMETERS: -file and -gene are required

·        -file [fasta filename]

·        -gene [gene name abbreviation]

·        -start [start position] - default = 150

·        -size [average product size] - default = 1000

·        -overlap [average overlap between products] - default = 180

·        -nouniv - switch if present universal sequence is not added to primers

·        -setnumb [primer set number] - default = 1

 

PGA_CLEANUP.PL: THIS PROGRAM CREATES TWO FILES: [GENE].NNXXTYPE.OUT AND [GENE].CLEANUP.OUT. THESE FILES ALLOW THE ANALYSTS A QUICK METHOD OF DETERMINING WHICH PRODUCTS NEED CLEANUP BY LISTING THE NUMBER OF SAMPLES MISSING FOR EACH SITE IN EACH PRODUCT.

COMMAND LINE PARAMETERS:

·        -pb [gene.prettybase.txt]

·        -db [gene.database.txt]

·        -layout [/c16/pga.layout] OR [/ff1/EGP/egp.layout]

 

PGA_FINISH_GENE.PL: THIS PROGRAM AUTOMATES THE FORMATIING OF THE OUTPUT FILES FOR PGA GENE COMPLETION AND SHOULD BE RUN IN THE STATS_DIR.

COMMAND LINE PARAMETERS:

·        -lab [lab symbol]

·        -hugo [hugo symbol]

·        -drive [drive the data is located on - for example c16]

 

PGA_GENE_ANALYSIS: THIS PROGRAM CONBINES POLYPHRED2DB, SQ4POLYPHRED AND SITE_STATS, USING THE SAME INPUT AS POLYPHRED2DB.

COMMAND LINE PARAMETERS:

·        -polyphred [name of polyphred output file] - remember to rerun polyphred to get the latest output information

·        -output [prefix] - this is appended to all created files - this parameter must begin with a letter and not a number

·        -database [name of postgres database to use]

·        -hugo [hugo symbol]

·        -table [table name] - usually the same as the output prefix - this parameter must begin with a letter and not a number

·        -sequence [sequence] in fasta format to map sites to - typically this is the genbank reference sequence

·        -population [file containing specific samples listed in a column] - /c16/pga.samples for the pga.

·        -query [specify postgres query options]

·        -sample (optional) regular expression used to separate sample name from file name - example '(\w{9})(\w{4})' - this will vary depending on the format of the filenames - the example {5} is for a two letter gene abbreviation and a three digit (alpha-numeric) sample name.

 

PGA_MOVECHROMATS.PL: FOR PGA GENES, THIS PROGRAMS MOVES NEW CHROMATS TO PROPER GLOBAL DIRECTORIES. THIS PROGRAM MUST BE RUN IN THE NEW_CHROMATS DIRECTORY OF THE APPROPRIATE GENE AFTER DATA HAS BEEN TRANSFERRED. THE PROGRAM WILL PERFORM THE FOLLOWING TASKS:

1.     Check the quality of the reads and writes the quality data to a file in the ../quality_dir directory

2.     Strip off the 'ab1' extension and adds the correct iteration number

3.     Move the chromats to the proper directories, such as ../chromat_dir, ../bad_chromats, and ../chimp_dir/chromat_dir

 

PGA_READ_STATS.PL: ASSESSES THE QUALITIES OF THE CHROMATS. THE RESULTS ARE SORTED IN THE OUTPUT.

 

PHASE_WRAP.PL: THIS PROGRAM IS USED TO RUN PHASE ON A GIVEN GENE.

OPTIONS: -polyphred and -output are required

·        -pb [name of prettybase output file]

·        -lab [lab symbol]



PHREDPHRAP: ASSEMBLY PROGRAM THAT RUNS BOTH PHRED AND PHRAP ON GIVEN DATA.

 

PHREDPHRAP.LONGREADS: ASSEMBLY PROGRAM THAT RUNS BOTH PHRED AND PHRAP ON GIVEN DATA FOR GENES LONGER THAN 56KB.

 

PHREDPHRAPPOLY: SAME AS PHREDPHRAP, BUT ALSO RUNS POLYPHRED TO APPLY TAGS.

 

POLYTRACT.PL: THIS FILE IDENTIFIES POLY N STRETCHES IN A FASTA FILE - GOOD FOR IDENTIFYING STRETCHES WHERE THE SEQUENCING REACTION MIGHT FAIL AND ALLOWS ONE TO PICK PCR PRIMERS AROUND THESE REGIONS. CURRENTLY SET AT DEFAULT OF N=11 BASES.

COMMAND LINE EXAMPLE:

·        polytract.pl [fasta reference filename]

 

POLYPHRED: THIS PROGRAM IS USED TO FIND POTENTIAL SNP SITES. THE SITES MUST BE CONFIRMED BY AN ANALYST. FOR FURTHER INFORMATION, SEE THE POLYPHRED DOCUMENTATION.

 

POLYPHRED2DB: CONVERTS THE POLYPHRED OUTPUT FILE TO DATABASE FORMAT, WHICH MEANS THAT YOU MUST BE CAREFUL TO PIPE THE RESULTS OF A POLYPHRED RUN TO ANOTHER FILE.

OPTIONS: -polyphred and -output are required

·        -polyphred [name of polyphred output file] - remember to rerun polyphred to get the latest output information

·        -output [prefix] - this is appended to all created files - THIS MUST BEGIN WITH A LETTER AND NOT A NUMBER

·        -database [name of postgres database to use]

·        -table [table name] - usually the same as the output prefix - THIS MUST BEGIN WITH A LETTER AND NOT A NUMBER

·        -sequence [sequence] in fasta format to map sites to - typically this is the genbank reference sequence

·        -sample regular expression used to separate sample name from file name - example '(\w{5})(\w{3})' - this will vary depending on the format of the filenames - the example {5} is for a two letter gene abbreviation and a three digit (alpha-numeric) sample name.

 

POPULATEFEATUREANDFINALREF.PL: THIS PROGRAM IS USED AFTER GENBANK RELEASES OUR SUBMISSION TO POPULATE FOLLOWING TWO TABLES NEED TO BE POPULATED (FINAL_REF_SEQ AND FEATURE_SITE) AND TABLE GENOMIC_SEG NEEDS TO BE UPDATED WITH THE GENBANK ACCCESSION NUMBER.

COMMAND LINE PARAMETERS:

·        -genbank_id [genbank accession number]

·        -lab_symbol [five character gene code used by Nickerson lab]

 

POPULATEGENOMICSEG.PL: STEP TWO IN THE INITIAL SETUP OF A GENE IN THE ORACLE DATABASE (STEP ONE IS POPULATEINITINFO.PL). FOR MORE INFORMATION SEE POPULATEINITINFO.PL.

COMMAND LINE PARAMETERS:

·        -locus_id [locus accession number]

·        -init_ref [gene.reference.fasta]

·        -lab_symbol [five character gene code used by Nickerson lab]

·        -type [gene or intergenic]

 

POPULATEINITINFO.PL: INITIAL STEP IN SETTING UP OF A GENE IN THE ORACLE DATABASE. THIS PROGRAM, USED IN CONJUNCTION WITH POPULATEGENOMICSEG.PL, POPULATES FOUR TABLES (CURATED_SEQ, INIT_REF_SEQUENCE, GENOMIC_SEG AND ALIAS). SEE

POPULATEGENOMICSEG.PL FOR MORE INFORMATION.

COMMAND LINE PARAMETERS:

·        -source [NT_xxxx.fasta] - the original genomic sequence

·        -init_ref [gene.reference.fasta]

·        -xm [xm file] - cross_match file between NT_xxxx.fasta and gene.reference.fasta

 

POPULATEPSQLDATA.PL: THIS PROGRAM IS DESIGNED TO LOAD THE POSTGRES DATABASE (WEB-PGA) ON DROOG WITH THE DATA FROM COMPLETED PGA GENES. THIS PROGRAM MUST BE RUN ON DROOG SINCE NONE OF THE POSTGRES CLIENTS ARE CURRENTLY SET UP FOR CONNECTIVITY TO DROOG; IT SHOULD BE RUN AT THE SAME TIME AS FINALIZE_WEB_DATA.PL WHEN THE GENBANK FILE IS PUBLISHED SINCE IT NEEDS TO ACCESS THE GENBANK PAGE VIA THE WEB FOR CERTAIN DATA.

COMMAND LINE PARAMETERS:

·        -hugo - hugo name for the gene

·        -lab_symbol - lab symbol for the gene

·        -genbank_id - the Genbank assigned accession number



PSTILL: CONVERTS A POSTSCRIPT (PS) FILE INTO AN PDF FILE.

 

QUERYCHROMATSTATUS.PL: THIS PROGRAM IS USED TO QUERY THE STATUS OF ANY CHROMAT IN THE DATABASE. IT CAN EITHER BE USED TO QUERY THE STATUS OF ALL CHROMATS WITHIN A SPECIFIED GENE, THE STATUS OF ALL CHROMATS WITHIN A SPECIFIED PRODUCT (OR PRIMER) WITHIN A GENE, THE STATUS OF A PARTICULAR SAMPLE OR ALL CHROMATS ASSOCIATED WITH A PARTICULAR STATUS WITHIN A GENE.

COMMAND LINE PARAMETERS:

·        -lab_symbol [required]

·        -database [required]

·        -product - must include any preceeding 0's (i.e. 0112 or 0010)

·        -sample [optional]

·        -status [optional] - choices are used, bad or saved

 

QUERYCONSENSUSSEQ.PL: AFTER ALL THE "ANALYSIS TABLES" ARE POPULATED, YOU CAN QUERY OUT THE CONTIG CONSENSUS SEQUENCE AND GENOTYPE. THE SEQUENCE IS AUTOMATICALLY OUTPUT TO LAB_SYMBOL.CONTIGS.

COMMAND LINE PARAMETERS:

·        -polyphred [polyphred output file]

·        -lab_symbol [five character gene code used by Nickerson lab]

 

QUERYGENOTYPE.PL: THE RESULTS ARE OUTPUT TO LAB_SYMBOL_DATABASE.TXT

COMMAND LINE PARAMETERS: -sample is optional.

·        -polyphred [polyphred output file]

·        -lab_symbol [five character gene code used by Nickerson lab]

·        -sample - regular expression used to separate sample and file name in the chromat name [the defualt is set to '(\w{9})(\w{4})' which is the format used by the PGA]

 

RC: THIS PROGRAM REVERSES AND COMPLEMENTS A CASE-INSENSITIVE SEQUENCE ENTERED ON THE COMMAND LINE, REPORTING THE RESULTS TO STANDARD OUTPUT.

COMMAND LINE EXAMPLE:

·        rc <sequence>

 

RE: FINDS WHICH POLYMORPHISMS ARE RESTRICTION ENZYME SITES. THIS PROGRAM NEEDS A REFERENCE SEQUENCE IN FASTA FORMAT ALONG WITH ALLELES WHICH IS CREATED BY SQ4POLYPHRED.

COMMAND LINE PARAMETERS:

·        -sequence [name of reference sequence in fasta format]

·        -alleles [alleles filename] - the alleles file is created by sq4polyphred

 

REFCOMP: REFCOMP WAS DESIGNED TO ANALYZE SEQUENCING TRACES WHICH CONTAINS DATA FROM STRICTLY HOMOZYGOUS SAMPLES (EG. CLONED DNA, MITOCHONDRIAL DNA, ETC.). THIS DATA REPRESENTS A SPECIAL CASE WHICH CAN BE ANALYZED FOR MISMATCHES WITH A KNOWN SEQUENCE. REFCOMP WILL DETERMINE THE HIGH QUALITY POSITIONS WITHIN AN ASSEMBLED DNA CONTIG AND PRODUCE A REPORT LISTING THESE SITES. IN THE NICKERSON LAB, REFCOMP IS USED TO ANALYZE MISMATCHES BETWEEN HUMAN AND CHIMP SEQUENCES. FOR MORE INFORMATION, SEE REFCOMP DOCUMENTATION.

 

REGIONSNOTSCANNED.PL: THIS PROGRAM AUTOMATICALLY COMPUTES THE REGIONS THAT ARE NOT SCANNED FOR VARIATION.  SINCE -MIN_COUNT AND -MIN_LENGTH ARE PARAMETERS ENTERED ON THE COMMAND LINE, ANALYSTS SHOULD USE CONSED TO VISUALLY VERIFY THAT THE APPROPRIATE REGIONS WERE SELECTED.

COMMAND LINE PARAMETERS:

·        -polyphred [betaPolyphred output file]

·        -min_count - the minimal number of chromats to be considered for a region to be 'adequately scanned'.

·        -min_length - the minimal length for a region to be considered 'adequately scanned' (less than this length means 'region not scanned for variation').

 

REPEAT_MASKER: LOOKS FOR REPEATS WITHIN ONE FILE, USING THE -XM FLAG CREATES A CROSS_MATCH FILE AS OUTPUT.

COMMAND LINE EXAMPLE:

·        RepeatMasker -xm [filename]

 

RMPOLYPHREDEXP.PL: THIS PROGRAM IS USED TO DELETE ALL DATA FROM THE DATABASE IF THE DATABASE WAS POPULATED WITH 'FINAL' DATA AND UPDATES NEED TO BE MADE.

          COMMAND LINE PARAMETERS:

·        -polyphred OR -poly_exp_id [either the polyphred output file OR the polyphred experiment id]

·        -database [egp OR pga]

·        -lab_symbol

·        -pwd [password for accessing the database]

 

SETUP_STD_DIR.PL: IT SETS UP ALL THE STARDARD DIRECTORIES FOR GENE ANALYSIS WHEN A GENE DIRECTORY IS SET UP.

SITE_STATS.PL: THIS PROGRAM CALCULATES THE TOTAL NN'S, XX'S AND COMPLETE GENOTYPES FOR EACH SITE ALONG WITH THEIR PERCENTAGES OUT OF THE TOTAL NUMBER OF SAMPLES. A PRETTYBASE FILE MUST BE SPECIFIED ON THE COMMAND LINE, ALONG WITH A MAPTOREF FILE, A DATABASE AND AN OUTPUT FILENAME.

COMMAND LINE PARAMETERS:

·        -pb [name of prettybase file]

·        -db [postgres database name]

·        -maps [mapToRef file]

·        -out [output filename]

 

SQ4POLYPHRED: WRAPPER THAT PERFORMS STATISTICAL ANALYSIS, STORING THE RESULTS IN VARIOUS OUTPUT FILES.

OPTIONS: -database and -table are required

·        -database [name of postgres database to use]

·        -table [tablename]

·        -population [specific samples]

·        -query [specify postgres query options]

 

TAG_REFERENCE_SEQUENCE.PL: THIS PROGRAM PLACES NUMBERING TAGS ON THE REFERENCE SEQUENCE TO ALLOW SITES TO BE NOTED BY THE REFERENCE SEQUENCE NUMBER RATHER THAN THE CONSENSUS SEQUENCE NUMBER SINCE THE CONSENSUS SEQUENCE MAY CHANGE ON SUBSEQUENT RUNS. A REFERENCE PHD FILE AND A REFERENCE CHROMAT FILE MUST BE PRESENT - THE NEWLY TAGGED REFERENCE PHD.FILE IS CALLED 'TEMP' AND MUST THEN BE PLACED IN THE PHD_DIR AND GIVEN THE ACTUAL REFERENCE SEQUENCE PHD FILENAME

COMMAND LINE EXAMPLE:

·        tag_reference_sequence.pl [reference.phd filename]

 

TBL2ASN: THIS IS A COMMAND LINE PROGRAM USED TO GENERATE THE GENBANK FILE. IT REQUIRES A TEMPLATE FILE CONTAINING STANDARD SUBMISSION INFORMATION, THE GENE FEATURES TABLE [HUGO_NAME.TBL] CREATED BY MAKE_GENBANKTABLE.PL, AND A FASTA FILE OF THE GENE SEQUENCE YOU WISH TO SUBMIT [HUGO_NAME.FSA]. THIS PROGRAM REQUIRES THAT THE TABLE NAME AND THE FASTA FILE HAVE THE SAME BASENAME [HUGO_NAME].

COMMAND LINE PARAMETERS:

·        -t [template_filename] = /usr/local/genome/src/sequin_*/pga_template.sbt

·        -p [path to input files]

 

TRANSLATE: THIS PROGRAM MAKES A PROTEIN TRANSLATION OF THE CDNA FILE.

COMMAND LINE PARAMETERS:

·        -cdna [gene.cdna.fasta]

·        -pep [gene.protein.fasta]

 

TRANSLATESNP.PL: THIS PROGRAM COMBINES TRANSLATE AND TRANSLATECSNP, GIVING <GENE>.PROTEIN.FASTA AND <GENE>.CSNPS.TXT AS OUTPUT. TRANSLATE_NEW.PL IS INCORPORATED INTO FINISH_GENE.PL, WHICH IS USED FOR COMPLETING THE GENE DATA.

COMMAND LINE PARAMETERS:

·        -xmfile [gene.exons.xm]

·        -cdna [gene.cdna.fasta]

·        -alleles [gene.alleles.txt]

·        -lab [lab_symbol]

·        -startpos [start position - only use if necessary to force the starting position]

 

UPDATECHROMATSTATUS.PL: THIS PROGRAM UPDATES THE CHROMAT STATUS IN THE ORACLE DATABASE. IT IS NECESSARY TO UPDATE THE STATUS TO EITHER BAD OR SAVED DEPENDING ON THE SITUATION. FOR EXAMPLE, IF CHROMATS COULD NOT BE ADDED USING ADDNEWREADS, THEY WOULD NEED TO BE UPDATED TO SATUS 'BAD' WHEREAS IF A PRIMER WAS FOUND TO BE ALLELE SPECIFIC THE STATUS OF THE CORRESPONDING CHROMATS WOULD NEED TO BE UPDATED TO 'SAVED'. THE PROGRAM ALSO MOVES THE ASSOCIATED CHROMATS TO THE CORRECT DIRECTORY AND REMOVES THE CORRESPONDING PHD AND POLY FILES.

COMMAND LINE PARAMETERS:

·        -status [bad or saved]

·        -chromat [file containing list of chromats to be updated] - this file should contain a list of chromats to be modified

·        -database - specify the database to be used (pgadev or egpdev)

·        -pwd - database password

 

VG: MUST BE USED WITH A PRETTYBASE FILE, THIS PROGRAM CREATES A "PICTURE" OF POLYMORPHIC SITES. THE COLORS REPRESENT HOMOZYGOUS SITES, HETEROZYGOUS SITES, AND SITES THAT HAVE CONFLICTS OR UNRESOLVED GENOTYPE MISMATCHES. THIS PROGRAM CAN BE RUN WITH A CONSED FLAG IN ORDER TO HAVE AN INTERACTIVE PROGRAM - CLICKING ON A BOX WILL BRING UP THE MATCHING SITE IN CONSED.

OPTIONS: -file is required

·        -file [prettybase filename] - the prettybase file is created by sq4polyphred

·        -ps [filename.ps] - insert the name of the .ps file to be created

·        -ola_table [ola_table filename]

·        -database [psql database] - specify which psql database to use

·        -font [font] - specify the font

·        -row [row]

·        -table [table] - specify the postgres table to use

·        -rare [value] - filters out sites that have the specified number (value) of rare alleles - homozygous alleles are counted twice because there are technically TWO rare alleles at such a site, while heterozygous sites are only counted once (one rare allele and one common allele present)

·        -regex [regular expression] - specify the regular expression

·        -consed [socket value - NOT 5432] - usr must supply a socket value for consed to use, but this value cannot be 5432, which is used by postgres (it is recommended to choose a high value to limit conflicts with other processes

·        -unit [unit]

·        -col [column]

 

VG_ORACLE: THIS IS MODIFIED FROM THE CURRENT VG PROGRAM (FOR FURTHER INFORMATION, SEE VG). IT USES THE SAME FLAGS AS THE ORIGINAL VG PROGRAM, HOWEVER THE FOLLOWING TWO FLAGS NEED TO BE ENTERED DIFFERENTLY:

OPTIONS WHICH DIFFER FROM ORIGINAL VG:

·        -database - this will be the oracle database name (SID) [pgadev for PGA development database]

·        -table [lab_symbol]

 

VG2PB: THIS FILE CREATES A PRETTYBASE FILE FROM A VG PLOT

COMMAND LINE PARAMETERS:

·        -file [prettybase filename] - the prettybase file is created by sq4polyphred

·        -rare [value] - filters out sites that have the specified number (value) of rare alleles - homozygous alleles are counted twice because there are technically TWO rare alleles at such a site, while heterozygous sites are only counted once (one rare allele and one common allele present)

 

VP: WRAPPER WHICH RUNS PHREDPHRAPPOLY AND BRINGS UP CONSED. IT IS IMPORTANT TO KEEP IN MIND THAT THIS PROGRAM CREATES LINKS WITH THE GLOBAL DIRECTORY ALLOWING FOR LOCAL ASSEMBLIES - THEREFORE IT MUST BE RUN IN ANALYSIS_DIR/EDIT_DIR. IN ADDITION, THERE ARE FLAGS WHICH ALLOW THE USER TO PULL IN ONLY SPECIFIED DATA INCLUDING ANYTHING WITH A SPECIFIED DATE OR ANYTHING FROM A SPECIFIED PRIMER. THIS PROGRAM ALSO COPIES ANY LOCAL CHANGES BACK UP TO THE GLOBAL DIRECTORIES.

OPTIONS: -project is required

·        -project [project] - usually this is the gene name or a portion of the gene name

·        -mode [mode]

·        -quality [quality] - specify the quality desired for running polyphred

·        -data [data] - this allows for local assemblies of part of the data

·        -rank [rank] - specify the rank desired for running polyphred

·        -date [date] - this allows the option to run the data processed on a specific date

·        -ratio [ratio] - specify the ratio desired for running polyphred

·        -minscore [minscore] - minscore value desired for running phredPhrap

·        -background [background] - specify the background desired for running polyphred

·        -minmatch [minmatch] - minmatch value desired for running phredPhrap

·        -days [number of days] - this allows the user to choose specific data as the data processed on a certain day - rather than entering a date (as above) the user would enter a number specifing the days before today (ie today = 0, yesterday = 1)

 

VS: PARSES THROUGH POLYPHRED OUTPUT FILES AND GIVES THE AVERAGE LENGTHS OF READS. THIS IS PRIMARILY USED TO GENERATE CLEAN-UP LISTS ON READS THAT ARE NOT A MINIMUM LENGTH (250-300 BP). NOTE - PIPE THE OUTPUT TO A FILE

COMMAND LINE EXAMPLE:

·        vs -polyphred [polyphred output file]

 

XFR2CONS: THIS PROGRAM WAS DESIGNED TO COPY ANY TAGS THAT ARE ON THE READS TO THE CONSENSUS SEQUENCE. POLYPHRED USED TO ONLY RETAIN THE TAGS THAT WERE ON THE READS, BUT NOW RETAINS THE TAGS ON THE CONSENSUS SEQUENCE - EASIER TO VIEW BUT THERE WILL BE A "CHANGE-OVER" PERIOD SINCE SOME FILES NOW HAVE THE TAGS ON THE READS WHILE OTHERS HAVE THEM ON THE CONSENSUS.

COMMAND LINE PARAMETERS:

·        -polyphred [polyphred output file]

·        -navigator ["navigate by tags" output filename] - this file is created in consed - simply "navigate by tags", choosing 'realPolymorphism' tags and save the produced list

 

XM2TAGS.PL: THIS PROGRAM CAN BE USED TO TAG THE REFERENCE SEQUENCE WITH ANY DESIRED INFORMATION OBTAINED BY RUNNING CROSS_MATCH - THIS IS PARTICULARLY USEFUL FOR TAGGING THE REFERENCE SEQUENCE WITH PRIMER OR EXON LOCATIONS. A TAG TYPE MUST BE SPECIFIED AND THE PROGRAM USES A .XM FILE AS INPUT WHICH IS CREATED WHEN CROSS_MATCH IS RUN AND THE OUTPUT IS PIPED TO [FILENAME.XM]. IF THE -TAG FLAG IS SET TO PRIMER, THE PROGRAM CHECKS TO MAKE SURE THAT EACH PRIMER IS ONLY REPRESENTED ONCE IN THE XM FILE. THE PROGRAM CREATES AN OUTPUT FILE NAMED '[$ARG{-TAG}_TAGS.TXT' WHICH MUST BE CONCATENATED ONTO THE DESIRED .PHD REFERENCE FILE:

COMMAND LINE EXAMPLE:

·        xmtotags.pl -xm [xm filename] -tag [tag name, such as primer]

·        cat temp.file >> [reference phd.filename]