EGPPGA Final Gene Cleanup Checklist



Overview:

TThis protocol is the detailed procedure for doing the final cleanup and verification on polymorphism data for a gene. This should be done once
verification on polymorphism data for a gene. This should be done once
all cleanup data for a gene has been ordered and a gene is in the finishing stage. The criteria for a finished gene are:
 finishing stage. The criteria a
finished gene are:


    1.All PCR products have been sequenced or attempted multiple times
    2.Greater than 90% of all genotypes for the gene are completed
(check the “Genotyped” number under “Summary for database” at the bottom of
 
<gene>.stats.txt) AND
    3.No single SNP site is missing more than
280% of the data.  (For each SNP site, 80% of the samples should be genotyped).

Information on SNP site statistics can be found in the file
<gene>.stats.txt after running pAnalysisTables.pl and ga.pl.
<gene>.stats.txt after running
pga_gene_analysis.

Following this verification protocol, the
EGPPGA Final Data Formatting protocol should be followed.

 protocol should be followed.

Required Files:

Latest assembly file:
<gene>.fasta.screen.ace.<iter>

Required Programs:

 

updateChromatStatus.pl
vg2vg
phredPhrap
polyPhred
(version 4.0)
consed
pAnalysisTables.pl

ga.plpga_gene_analysis

Cleanup Checklist:

Abbreviated Procedure:


   1.  Reassemble the gene with all of the chromatograms.

    ·        Reassemble by running phredPhrap
    ·        Perform all contig joins into the largest contig in the

assembl
y (usually with reference sequence.
    ·        Generally joins are performed by grouping all small contigs

together before placing these in the largest contig containing the

reference sequence.

    ·        Verify that there is no region with a major assembly
problem (many mismatches)


   2.  Confirm/edit the consensus sequence.

    ·        Compare consensus sequence against reference sequence
    ·        Verify all mismatches/discrepancies

    ·        Review beginning and end of consensus sequence an
d edit to
match primer sequences

    ·        Identify regions where the assembly is complex (highly

repetitive), poor quality or no data exists.  Record these regions as

"Region not scanned for variation".

    ·        Save the assembly


   3.  Run polyphred on the final assembly.

     polyphred -ace <latest ace file> -quality 25 >

<gene>.polyphred.<iter>.out


    ·        Polyphred can be run at a lower quality to increase your

genotype coverage (usually not lower than 20)

    ·        If you reduce the
quality identify low quality genotypes
which have been introduced.  This can be done using compare_genotype.pl


   4.  Scan quickly through the final assembly for missed polymorphic

sites.


     ·        Navigate to all "polyphredRank1" sites and checked those
not confirmed.

     ·        Navigate to all "checkGenotype" sites and confirm.

     ·        Look for polymorphism tags in the assembly where polyphred

has not marked the column  - these could have been missed.

     ·        Verify diallelicIndel and
manualPolysite sites and
genotype.

          NOTE:  All of these types of polymorphisms are referenced with
respect to the positive (forward orientation) strand.

     ·        Save the assembly and rerun polyphred.


   5.  Perform a final visual check and
check genotype statistics.

     ·        Run
pga_gene_analysis on the final polyphred output file.
     ·        Verify data using vg with the consed option
     ·        Resolve any conflicting data.

     ·        Check the genotype of all singleton and doubleton sites.
These must be confirmed or very high quality.

     ·        Check all genotypes of sites with significant

Hardy-Weinberg scores.

        Confirm gentoypes on all sites where the population specific HW

value is greater than 3.8.  Abnormall
y high HW values can be generated
if a site is rare in a specific population (<5% allele frequency).

     ·        Check for differences in the LDE patterns between

correlated sites.

     ·        NOTE:  It helps to run vg and view only common sites

(-rare
10)
     ·        If any changes are made, polyphred and
pga_gene_analysis
must be rerun.

        For the final  assembly the gene should be named as with just
the <gene> name.




Detailed Procedure:

  

1.         Update the status for any singles, bad chromats or allele specific chromats.

 

           You will update the status of many of your chromats as you check in each day’s data, especially for any allele specific reads.  Run updateChromatStatus.pl to update the status of any chromats that   

           still need to be updated.

         

            updateChromatStatus.pl  -chromat_list <gene>.date.status.txt  -status  -database

            where     -chromat_list is the list of the chromats to be updated.     

                            -status is either “USED”, “BAD”, “SUSPECT”, ALLELE_SPECIFIC” or “MISNAMED”

 

Bad chromats might include any chromats you pulled out of the assembly.  Also,the singlets should be changed to “BAD”.  The list of  singlets can be found in the edit_dir in the file <gene>.fasta.screen.singlets. 

   

 SUSPECT” reads might include reads that look good, but aren’t aligning in the correct position. 

 

           For any allele specific reads, double check that the corresponding forward or reverse reads are removed for each allele specific chromat.  Also make sure to remove data for any reads where it’s unknown if the read is allele   

           specific or not (due to missing data or bad data).

  

  
21.      Reassemble the gene with all of the chromatograms.
   

·         Double check to make sure that the reference sequence has been trimmed to include only end primers which have been used.  Also trim based on sequence data (details below).

·        
    ·    
    Reassemble by running phredPhrap

·        
    ·       
Perform all contig joins to the largest contig in the assembly (usually with reference sequence).

·        
assembly (usually with reference sequence.
    ·       
Generally joins are performed by grouping all small contigs together before placing these in the largest contig containing the reference sequence.
You should end up with only one large contig that contains all “
USED” reads.together before placing these in the largest contig containing the

·        
reference sequence.
    ·       
Verify that there is no region with a major assembly problem (many mismatches)

·        
problem (many mismatches)
Run 2fof.pl to check that all the used chromats are in the assembly.

 

 



If the consensus sequence is longer than the reference sequence, NN out the reads that are overhanging the reference sequence.  PhredPhrap will not take these into account when it creates the consensus sequence.  If the reference sequence is longer than the consensus sequence, edit the <gene>.reference.fasta file, editing off the bases before the first primer or after the last primer.  Remove the first phd file from the phd_dir and create a new phd file using mktrace. 

 

After you have received all of the cleanup data for a gene and you are near final assembly the data should be reassembled to generate a final
near final assembly the data should be reassembled to generate a final

contig. This will give you a new assembly.  This will ensure that the
consensus sequence and quality values are updated from all the available data.  Perform all
 consensus sequence and
quality values are updated from all the available data.  Perform all
contig joins as necessary to give the best single consensus contig
 possible. Verify that there is no region with a major assembly problem.
This is generally signified by many

 high quality mismatches (i.e. red bases in consed). 
2fof.pl can be run one more time to verify that phredPhrap did not take out chromats that should be included in the final assembly..


  
32.      Confirm/edit the consensus sequence

·         .

  
  ·        Compare consensus sequence against reference sequence

·         e
    ·       
Verify all mismatches/discrepancies

·        
    ·       
Review beginning and end of consensus sequence and edit to match primer sequences

·        
match primer sequences
    ·       
Identify regions where the assembly is complex (highly repetitive), poor quality or no data exists.  Record these regions in your notes as "Region not scanned for variation". (To be used in the Genbank submisstion).

·        
repetitive), poor quality or no data exists.  Record these regions as
"Region not scanned for variation".

    ·       
Save the assembly





Scan through the entire gene and compare the consensus sequence against the reference sequence in the assembly.  Mismatches will usually be signified by a red base in the reference sequence.  At confirmed SNP sites either

the reference sequence in the assembly.  At confirmed SNP sites either
allele may be present.  There should not be major base discrepancies, though single bases may disagree.  In these cases, view the chromatogram data and verify
 though single bases
may disagree.  In these cases, view the chromatogram data and verify
that consensus sequence is correct.  If it is not clear from the reads what the correct consensus sequence should be (due to low quality data or low complexity data) as a default use the (Genbank curated) reference sequence in the assembly.

If the consensus sequence is correct but the reference sequence is wrong no changes are necessary.

NOTE: If after reviewing the chromatogram data it is clear the consensus
sequence should changed, it can be overridden by bringing up a
chromatogram sequence and using <M2> (mouse button 2) on the position to
be changed. From the
be changed. From the
subsequent menu choose "Change Consensus". This will change the
consensus sequence to the base at that position in the chromatogram.

It is especially important to review the consensus sequence at very
beginning and end of the at/near the first and last primer positions.
 beginning and end of the at/near the first and last primer positions.
If this sequence differs from the consensus sequence (usually due to low
quality data in this region), it should be overridden with the sequence from the assembled reference to match the primers.

quality data in this region), it
should be overridden with the sequence from the assembled reference to
match the primers.


At this point you should also check for region which may have not been completely scanned.  This may be due to poly(N) tracts truncating a
completely scanned.  This may be due to poly(N) tracts truncating a
read, a region which was difficult or unable to be amplified by PCR, or other just low quality
other just low quality
sequence.  If the region is fairly small (<100 bp) with decent coverage we consider it scanned.  If it is a complete gaps without any data


 we consider it scanned.  If it is a complete gaps without any data
record the coordinates of the consensus so this can be annotated later
in the GenBank submission file as “Region not Scanned for Variation”.
 in the GenBank submission
file.


After reviewing the consensus sequence save the assembly in Consed.


  
43.      Run polyphred on the final assembly.

     polyphred -ace <latest ace file> -quality 25 >  <gene>.polyphred.<iter>.out
<gene>.polyphred.<iter>.out

·        

    ·       
Polyphred can be run at a lower quality to increase your coverage (usually not lower than 20).

·        
genotype coverage (usually not lower than 20
)
    ·       
If you reduce the quality identify low quality genotypes which have been introduced.  This can be done using compare_genotype.pl

·        
which have been introduced.  This can be done using compare_genotype.pl
You can also apply tags to the gold “dataNeeded” regions of the gene, but only apply these tags conservatively.  This may allow you to increase coverage without lowering the polyphred quality.

 



  
54.      Scan quickly through the final assembly for missed polymorphic sites.
 polymorphic sites.

·        

     ·       
Navigate to all "polyphredRank1" sites and checked those not confirmed.  Also check “polyphredRank 2” and “polyphredRank 3” sites.

·        
not confirmed.

 
    ·        Navigate to all "checkGenotype" sites and confirm.

·         .
     ·       
Look for polymorphism tags in the assembly where polyphred has not marked the column  - these could have been missed.

·        
has not marked the column  - these could have been missed.
     ·       
Verify diallelicIndel and manualPolysite sites and genotype.
 genotype.
         
NOTE:  All of these types of polymorphisms are referenced with
respect to the positive (forward orientation) strand.

·        
     ·       
Save the assembly and rerun polyphred.

 




    
A) Make sure that no obvious SNP sites have been missed, especially
      for unconfirmed rank 1 sites (red).  The “polyphredRank 2” and “polyphredRank 3” sites should also be checked. .

     B) Look for previous polymorphism tags left on a site but not marked by polyphred in the final assembly. This may mean that no
   
 marked by polyphred in the final assembly. This may mean that no
polymorphism exists at that site or that polyphred could not pick up a
polymorphism at this site anymore once new data had been added


polymorphism at this site

        anymore once new data had been added.

     If polyphred is missing a real polymorphism which was previously marked you should apply a "manualPolySite" on the consensus sequence at
     this position.  If polyphred is rerun, the individual genotypes should be called.  Double check that these sites were called correctly.
marked you should apply a "manualPolySite" on the consensus sequence at
this position
and "manualGenotype" at each chromatogram position.
Alternatively, one
        could run polyphred at a lower rank setting to see if the site
is called - this can save time by avoiding the need to apply the
manualGenotype tag.



    
C) Verify that all "diallelicIndel" and "manualPolySite" tags are located at the correct positions and the "manualGenotype" tags are
 located at the correct positions and the "manualGenotype" tags are    
correct (ConsensusPosition).  This is important because the consensus coordinate may change following reassembly.

coordinate may change
        following reassembly.

 
For "diallelicIndel" tags, make sure to update the consensus sequence to reflect the longest form of the indel.  To mark an indel in the consensus sequence, swipe the length of the indel with the middle mouse key.  From the list that pops up, select “diallelic indel”.  For each individual site, mark the reads at the first base (if the indel is longer than one base).  If the indel needs to be annotated on top of a pad, change the pad to N before placing the annotation.

 

There are hotkeys for annotation of indels:

Control i = indel + +

Control d = indel - -

Control c = indel + -

     Both ?diallelicIndel? and ?manualPolySite? tags require explicit
genotypes to be applied to the reads at these positions.



                Diallelic Insertion/Del.
                                   Unmarked Polymorphism
 Consensus Tag
                diallelicIndel
                                   ManualPolySite
 Read Tag
                indel
                                   ManualGenotype



     :
     The general format for read tags is:  ConsensusPosition Allele1
Allele2

     e.g. 200 AG AG
     200 - AG
     200 - -

     NOTE: diallelicIndels are also referenced with respect to the
positive strand (forward orientation).
  This is important if there is only reverse data for and indel site AND the indel is longer than one base pair.
     NOTE:  We are not calling multiallelicindels.

     NOTE:  We are not calling indels after poly-N tracks.  (Poly-N tracks are repeats of any one base more than 7 times).

     NOTE:  Make sure the consensus sequence reflects the longest version of the indel. 

     NOTE:  Always mark the indel with respect to the consensus sequence, not the read.  In some cases you won’t be able to mark the read exactly due to pads.  In this case overstrike the pad with a base and apply the appropriate indel

genotype.

     NOTE:  In cases where the indel is in a repeated unit, mark where you first observe the indel, not the first instance of the repeated unit.
    

D) Save the assembly to save any new changes and rerun polyphred

 




  
65.      Perform a final visual check and check genotype statistics.

 

·        

     ·       
Run pAnalysisTable.pl and ga.plpga_gene_analysis on the final polyphred output file. Double check the <gene>.stats.txt file.

·        
    
·        Verify data using vg2g with the consed option

·          
     ·       
Resolve any conflicting data.

·        
     ·       
Check the genotype of all singleton and doubleton sites. These must be confirmed or very high quality.

·          
These must be confirmed or very high quality.

     ·       
Check all genotypes of sites with significant Hardy-Weinberg scores.
Hardy-Weinberg scores.
       
Confirm gentoypes on all sites where the population specific HW value is greater than 3.8. 

·        
value is greater than 3.8. 
Abnormally high HW values can be generated
if a site is rare in a specific population (<5% allele frequency).
     ·        Check for differences in the LDE patterns between correlated sites.
correlated sites.
  
   ·        NOTE:  It helps to run vg2 clustered for LDE. and view only common sites

·        
(-rare 10)

     ·       
If any changes are made, polyphred and pAnalysisTable.pl and ga.pl must be rerun.pga_gene_analysis

·        
 must be rerun.
       
For the final  assembly the gene should be named as with just
the <gene> name
. and the “final” option for the status switch should be used.

NOTE 01/31/02:  Currently, you can only run ga.pl with the “final” option once.  If you need to make changes and run ga.pl again, you need to ask someone to change the status in the database from “final” to “clean-up”.

Qian is currently writing a program that the analysts will be able to use to change the status.

 




    
A) run pAnalysisTable.pl and ga.plpga_gene_analysis on the final polyphred output file.  Double check the <gene>.stats.txt file, as the coverages may have changed following step 5.

     B) The data can be verified visually using vg2 with the -consed

option.


     ** Check the genotype of singleton and doubletons
     ** Check all sites with Hardy-Weinberg (HW) scores above 3.8 (from
the last column in the <gene>.hw.txt file).   This might be an indication of allele-specific amplification.
 the <gene>.hw.txt file).   This might be an indication of
allele-specific amplification.
     **
 Check for differences in the LDE patterns between correlated
 sites. sites.
   
         -It is also useful to run
vg2 clustered for LDE.to view only common sites when looking
at LDE correlations. For the PGA if you wanted to look at sites with >
10% allele frequency use the command:

     vg -file <gene>.prettybase.txt -rare 10


 
    C) If anything was modified you must repeat the polyphred analysis and pAnalysisTable.pl and ga.pl.
and
pga_gene_analysis.
 



Revision History:
mjr 22-April-2001 Wrote initial protocol
mjr 11-May-2001 Edited protocol and added major summary points.
mjr 26-May-2001 Edited protocol to make clearer.
mjr 20-June-2001  Made revisions suggested by CP

clp 28-Jan-2002 Edited protocol for EGP.