Instruction for Gene Analysis Using PGA/EGP Oracle database


      (A) Initial setup a gene in the database

            To initiate a gene in the database, the following 4 tables need to            be populated first,



            The following programs can be used to populatethose tables,



                  -source      ( NT_xxxx.fasta, the original genomic sequence)

                  -init_ref    ( gene.reference.fasta )

                  -xm          ( cross_match file between NT fasta and                                         gene.reference.fasta)

                  -database    ( database name )



                  -locus_id    ( locus accession number )

                  -init_ref    ( gene.reference.fasta )

                  -lab_symbol  ( 5-letter code for our genes )

                  -type        ( 2 choices you have: gene, or intergenic )

                  -database    ( database name )



            PGABECH web interface can be used to populate the following tables,


            Upload primers to database user Web site.



      (C) Move chromats to analysts



                  -database    ( database name )

                  -chimp_list  ( /c16/chimp.list )


            It is designed to carry out the following tasks,

                  1) check the qualites of all the chromats.

                  2) write the quality report into quality_dir.

                  3) populate the table CHROMAT in the database.

                  4) figure out the correct iteration number for each chromat.

                  5) move each chromat to proper directories and compress them.


      (D) Gene analysis


   is designed to populate the following                      tables in the database,






                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -exp_status    ( 3 choices you have: initial, clean-up, final)

                  -database      ( database name )


            After all the "analysis tables" are populated, you can try to query

            out the contig consensus sequence, genotype, snp list and diallelic

            indel list.



                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    (5-letter code for our genes )

                  -database      ( database name )

                  <-sample>      ( optional for pga. Default ‘(\w{9})(\w{4})’)


            The will query out the followings

                  Contig_consensus_seq   ( lab_symbol.contigs )

                  SNP list               ( lab_symbol.snp.list )

                  Diallelic indel list   ( lab_symbol.diallelic.list )

                  Genotypes for real sites  ( lab_symbol.database.txt )



                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )


            The only query out contig consensus sequences.

            Sequences is automatically output to lab_symbol.contigs



                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )

                  <-sample>      ( optional for pga. Default ‘(\w{9})(\w{4})’)


            The only query out all the genotypes for the real

            SNP’s and diallelic indels stored in database.

            The results are output to lab_symbol.database.txt


            To get a prettybase file and some statistical analysis such as

            alleles.out, alleles.txt, freq.txt, hz.txt, hw.txt, stats.txt,

            and to update the consensus sequences with allele frequency to get

            refSeq.fasta and hugo.fsa, you can run the following programs.



                  -db            ( lab_symbol.database.txt )

                  -population    ( /c16/pga.samples )

                  > lab_symbol.prettybase.txt  ( output file )



                  -pb            ( lab_symbol.prettybase file )

                  -indel         ( lab_symbol.diallelic.list )

                  The results are automatically output to

                        ( lab_symbol.alleles.out )

                        ( lab_symbol.alleles.txt )

                        ( lab_symbol.freq.txt )

                        ( lab_symbol.hz.txt )

                        ( lab_symbol.hw.txt )



                  -pb           ( lab_symbol.prettybase.txt )

                  -out          ( lab_symbol.stats.txt )

            It calculates the percentage of genotypes regarding the

            whole gene, the completeness in terms of site and individual.



                  -navigator     ( lab_symbol.snp.list )

                  -contig        ( lab_symbol.contigs )

                  -allele        ( lab_symbol.alleles.txt )

                  -hugo          ( hugo symbol )

                  The results are automatically output to




      If you don’t care any details and don’t like all this step-by-step

      protocol, you can run the following WRAPPER program. It will populate

      all the analysis tables and query out the standard “Nick PGA ” stuff.

      ( It requires you to do the “Nick PGA” standard naming conventions and

      analysis. )


                  -polyphred   ( betaPolyphred output file )

                  -database    ( database name )

                  -lab_symbol  ( 5-letter code for genes )

                  -population  ( /c16/pga.samples for pga )

                  -exp_status  ( 3 choices: initial, clean-up, final )

                  -hugo        ( hugo symbol )

                  <-sample>    (optional for pga. Default to ‘(\w{9})(\w{4})’)



      To view your prettybase file you can run vg_oracle,

            (9) vg_oracle

                  This is modified from the current vg program. It still has

                  the same flags. The two flags you need to enter differently


                  -database      ( database name )

                  -table         ( lab_symbol )


      (E) Final stage


            After genbank releases our submission, the following two tables need

            to be populated,

                  FINAL_REF_SEQ, FEATURE_SITE

And table GENOMIC_SEG needs to be updated with the genbank accession number.



                  -genbank_id    ( genbank accession number )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )