Instruction for Gene Analysis Using PGA/EGP Oracle database

 

      (A) Initial setup a gene in the database

            To initiate a gene in the database, the following 4 tables need to            be populated first,

                  CURATED_SEQ, INIT_REF_REFERENCE, GENOMIC_SEG, ALIAS.

 

            The following programs can be used to populatethose tables,

 

            1) populateInitInfo.pl

                  -source      ( NT_xxxx.fasta, the original genomic sequence)

                  -init_ref    ( gene.reference.fasta )

                  -xm          ( cross_match file between NT fasta and                                         gene.reference.fasta)

                  -database    ( database name )

 

            2) populateGenomicSeg.pl

                  -locus_id    ( locus accession number )

                  -init_ref    ( gene.reference.fasta )

                  -lab_symbol  ( 5-letter code for our genes )

                  -type        ( 2 choices you have: gene, or intergenic )

                  -database    ( database name )

 

      (B) PGABENCH/IBENCH

            PGABECH web interface can be used to populate the following tables,

                  PLATE, PRIMER, AMPLICON, PCR_EVENT, SEQ_EVENT.

            Upload primers to database user Web site.

 

           

      (C) Move chromats to analysts

           

            moveChromats.pl

                  -database    ( database name )

                  -chimp_list  ( /c16/chimp.list )

 

            It is designed to carry out the following tasks,

                  1) check the qualites of all the chromats.

                  2) write the quality report into quality_dir.

                  3) populate the table CHROMAT in the database.

                  4) figure out the correct iteration number for each chromat.

                  5) move each chromat to proper directories and compress them.

 

      (D) Gene analysis

 

            pAnalysisTables.pl is designed to populate the following                      tables in the database,

                  POLYPHRED_EXP, CONTIG, CONTIG_CONSENSUS_SEQ, CONTIG_CHROMAT

                  POLYSITE_IN_POLYPHRED_EXP, POLYMORPHIC_SITE,

                  POLYMORPHISM, PIPE_GENO, CHROMAT_GENOTYPE.

           

            (1) pAnalysisTables.pl

                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -exp_status    ( 3 choices you have: initial, clean-up, final)

                  -database      ( database name )

 

            After all the "analysis tables" are populated, you can try to query

            out the contig consensus sequence, genotype, snp list and diallelic

            indel list.

 

            (2) stdQuery.pl

                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    (5-letter code for our genes )

                  -database      ( database name )

                  <-sample>      ( optional for pga. Default ‘(\w{9})(\w{4})’)

 

            The stdQuery.pl will query out the followings

                  Contig_consensus_seq   ( lab_symbol.contigs )

                  SNP list               ( lab_symbol.snp.list )

                  Diallelic indel list   ( lab_symbol.diallelic.list )

                  Genotypes for real sites  ( lab_symbol.database.txt )

 

            (3) qConsensusSeq.pl

                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )

           

            The qConsensusSeq.pl only query out contig consensus sequences.

            Sequences is automatically output to lab_symbol.contigs

 

            (4) qGenotype.pl

                  -polyphred     ( betaPolyphred output file )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )

                  <-sample>      ( optional for pga. Default ‘(\w{9})(\w{4})’)

           

            The qGenotype.pl only query out all the genotypes for the real

            SNP’s and diallelic indels stored in database.

            The results are output to lab_symbol.database.txt

 

            To get a prettybase file and some statistical analysis such as

            alleles.out, alleles.txt, freq.txt, hz.txt, hw.txt, stats.txt,

            and to update the consensus sequences with allele frequency to get

            refSeq.fasta and hugo.fsa, you can run the following programs.

 

            (5) makePrettybaseFile.pl

                  -db            ( lab_symbol.database.txt )

                  -population    ( /c16/pga.samples )

                  > lab_symbol.prettybase.txt  ( output file )

 

            (6) getFreqHzHw.pl

                  -pb            ( lab_symbol.prettybase file )

                  -indel         ( lab_symbol.diallelic.list )

                  The results are automatically output to

                        ( lab_symbol.alleles.out )

                        ( lab_symbol.alleles.txt )

                        ( lab_symbol.freq.txt )

                        ( lab_symbol.hz.txt )

                        ( lab_symbol.hw.txt )

 

            (7) site_stats.pl

                  -pb           ( lab_symbol.prettybase.txt )

                  -out          ( lab_symbol.stats.txt )

            It calculates the percentage of genotypes regarding the

            whole gene, the completeness in terms of site and individual.

 

            (8) getFinalRef.pl

                  -navigator     ( lab_symbol.snp.list )

                  -contig        ( lab_symbol.contigs )

                  -allele        ( lab_symbol.alleles.txt )

                  -hugo          ( hugo symbol )

                  The results are automatically output to

                        Lab_symbol.refSeq.fasta

                        Hugo.fsa

 

      If you don’t care any details and don’t like all this step-by-step

      protocol, you can run the following WRAPPER program. It will populate

      all the analysis tables and query out the standard “Nick PGA ” stuff.

      ( It requires you to do the “Nick PGA” standard naming conventions and

      analysis. )

            (9) ga.pl

                  -polyphred   ( betaPolyphred output file )

                  -database    ( database name )

                  -lab_symbol  ( 5-letter code for genes )

                  -population  ( /c16/pga.samples for pga )

                  -exp_status  ( 3 choices: initial, clean-up, final )

                  -hugo        ( hugo symbol )

                  <-sample>    (optional for pga. Default to ‘(\w{9})(\w{4})’)

 

 

      To view your prettybase file you can run vg_oracle,

            (9) vg_oracle

                  This is modified from the current vg program. It still has

                  the same flags. The two flags you need to enter differently

                  are,

                  -database      ( database name )

                  -table         ( lab_symbol )

 

      (E) Final stage

 

            After genbank releases our submission, the following two tables need

            to be populated,

                  FINAL_REF_SEQ, FEATURE_SITE

And table GENOMIC_SEG needs to be updated with the genbank accession number.

 

                  populateFeatureAndFinalRef.pl

                  -genbank_id    ( genbank accession number )

                  -lab_symbol    ( 5-letter code for our genes )

                  -database      ( database name )