The Common Programs for Data Analysis

(here is a link to the OLD Analysis Protocol)

In new_chromats

It moves chromats out of <new_chromats> directory, populates the pga/egp databases, iteratate all the chromats through the standard directories under a gene, create quality report files under the <quality_dir>. Requires samples list.

In edit_dir


It creates an ace file from a phd file of the reference sequence in <phd_dir>. This can be used as initial generation of an assembly.

(Option: phredPhrap) --- old method, not advised

2) Consed

Add new reads.

Remove reads.


3) Polyphred

standard quality threshold: 25


It populates all the analysis information we need to pgatmp/egptmp databases (or pga/egp for the final analysis).


It extracts useful information of analysis from databases.

6) betavg2

To view the prettybase, resolve conflicts, and do some sorting and computation.

7) Fix the final consensus sequence (VERY IMPORTANT)

Before you go on to finish a gene, it is very improtant to fix the consensus sequence. That means you have to compare the initial reference sequnce and your final consensus sequence. If an allele at a non-snp site disagrees with the allele from the initial reference, and if our data is in a good quality, please fix the consensus sequence with the alleles from our data.

8) Run <> and <> to the final database (pga/egp)

In fasta_dir

1) Fix your gene.cdna.fasta against your gene.reference.fasta. After fixing the cdna file, run <> to make sure that the cdna can be translated correctly.

In stats_dir


It gathers all the important files from different directories into <stats_dir> and do some final processing.

2) and tbl2asn

To prepare tables for genbank submission. You should feed the <gene.repeats.xm> and the output from <> into <>.

3) sequin

To annotate the gene, cds, mRNA and misc for genbank file. You should save the final file as hugo.genbank.sqn and output a text file hugo.genbank.sqn.

After genbank submission:

PGA analysts

Analyze the chimp data, using <> which will write polymorphism confirmed tags at all variation site.
[-t]  “c” will write polymorphismConfirmed tags

[-pb] prettybase file of standard pga gene

[-lab_symbol] gene symbol

[-genbank] genbank file

Because the genbank.txt file you exported doesn't have nay value in VERSION

fields, you have to manually edit the genbank.txt file to put “tmp” in the

VERSION field in order for ths program to use it properly.

It will create ldSelct output files (gene.genotype.ED.txt, gene.genotype.AD.txt, gene.ED.clusters.txt, gene.AD.clusters.txt) for different populations.


It will generate gene.csnps.txt and run SIFT and polyPhen predictions for nonsynonymous snps, genreate sift and polyPhen prediction files (gene.siftPredict.txt, gene.pph.out, gene.pph-sift.txt).


It generates all the haplotype results.(Mark knows the details.)


It generates all the blat stuff for web_publication. (Josh knows the details.)

5) Transfer all the files to web server directory on droog </droog/httpd/html/pga/data/hugo >. Paul will take it over to do the web publication stuff.

6) Download an official genbank text file into </droog/httpd/html/pga/hugo/> after genbank releases the publication. Then run the following program:

[-genbank] the official genbank file

[-lab_symbol] 5-letter lab symbol for a gene

[-database] pga or egp

[-pwd] password for pga/egp database

It will bookkeep the genbank accession number, the features about a gene and the final sequence into pga database. Our web server will also depend on the records in pga database.

IT IS VERY IMPORTANT TO DO THIS STEP. (Paul will not handle this part any more).


When you get a genbank accession number, please run this program to make the files for dbsnp submission.

Updates on the submitted genes

If any modifications are made on the analysis of a submitted gene, in general, you will have to go through all the steps for finishing a gene in order to make all the updates in our website as well as in genbank and in dbsnp.

Misc. Programs

      If you just want to get a prettybase file quickly from a polyphred experiment, this program will gerneate a prettybase directly from polyphred output file without talking to the database.

It will query out all the pcr and sequencing work done for a gene. and

      i) If you populate the pgatmp or egptmp database with a most recent polyphred experiment, <> will output two files for you, gene.NNtype.out and gene.cleanup.out. The two files may be useful to make a cleanup list.

      1. You should also run <> to make sure all the important regions are covered. The default values for PGA submission are < -min_count 32 -min_length 100 >

      It creates tags (primer or exon) from a corresponding cross_match output file.

    It generates a list of new chromats vs a existing ace file.

      When you want to remove some chromats out of am assembly, use this program to update the status of the chromats in database.

      It deletes a polyphrd experiment record from pga/egp databases.

      When chromats are misnamed, this program will help you to change all the sequence event ids after you have fixed all the other parts in chromat names.

      It creates some files for dbsnp submission.


Original Version - QYI 20040129

Web Updates - NCH 20040201