PGA Final Gene Cleanup Checklist

 

Overview:

This protocol is the detailed procedure for doing the final cleanup and verification on polymorphism data for a gene. This should be done once all cleanup data for a gene has been ordered and a gene is in the finishing stage. The criteria a finished gene are:

 

1.All PCR products have been sequenced or attempted multiple times

2.Greater than 90% of all genotypes for the gene are completed AND

3.No single SNP site is missing more than 80% of the data.

 

Information on SNP site statistics can be found in the file <gene>.stats.txt after running pga_gene_analysis.

 

Following this verification protocol, the PGA Final Data Formatting protocol should be followed.

 

Required Files:

 

Latest assembly file: <gene>.fasta.screen.ace.<iter>

 

Required Programs:

 

vg

phredPhrap

polyPhred

consed

pga_gene_analysis

 

 

Abbreviated Procedure:

 

1.  Reassemble the gene with all of the chromatograms.

 

·        Reassemble by running phredPhrap

·        Perform all contig joins into the largest contig in the assembly (usually with reference sequence.

·        Generally joins are performed by grouping all small contigs together before placing these in the largest contig containing the reference sequence.

·        Verify that there is no region with a major assembly problem (many mismatches)       

 

2.  Confirm/edit the consensus sequence.

 

·        Compare consensus sequence against reference sequence

·        Verify all mismatches/discrepancies

·        Review beginning and end of consensus sequence and edit to match primer sequences

·        Identify regions where the assembly is complex (highly repetitive), poor quality or no data exists.  Record these regions as "Region not scanned for variation".

·        Save the assembly

 

3.  Run polyphred on the final assembly.

 

polyphred -ace <latest ace file> -quality 25 > <gene>.polyphred.<iter>.out

 

·        Polyphred can be run at a lower quality to increase your genotype coverage (usually not lower than 20)

·        If you reduce the quality identify low quality genotypes which have been introduced.  This can be done using compare_genotype.pl

 

4.  Scan quickly through the final assembly for missed polymorphic sites.

 

·        Navigate to all "polyphredRank1" sites and checked those not confirmed.

·        Navigate to all "checkGenotype" sites and confirm.

·        Look for polymorphism tags in the assembly where polyphred has not marked the column  - these could have been missed.

·        Verify diallelicIndel and manualPolysite sites and genotype.

NOTE:  All of these types of polymorphisms are referenced with respect to the positive (forward orientation) strand.

·        Save the assembly and rerun polyphred.

 

5.  Perform a final visual check and check genotype statistics.

 

·        Run pga_gene_analysis on the final polyphred output file.

·        Verify data using vg with the consed option

·        Resolve any conflicting data.

·        Check the genotype of all singleton and doubleton sites.  These must be confirmed or very high quality.

·        Check all genotypes of sites with significant Hardy-Weinberg scores.

Confirm gentoypes on all sites where the population specific HW value is greater than 3.8.  Abnormally high HW values can be generated if a site is rare in a specific population (<5% allele frequency).

·        Check for differences in the LDE patterns between correlated sites.

·        NOTE:  It helps to run vg and view only common sites (-rare 10)

·        If any changes are made, polyphred and pga_gene_analysis must be rerun.

For the final  assembly the gene should be named as with just the <gene> name.

 

 

 

Detailed Procedure:

 

1.      Reassemble the gene with all of the chromatograms.

 

·        Reassemble by running phredPhrap

·        Perform all contig joins to the largest contig in the assembly (usually with reference sequence.

·        Generally joins are performed by grouping all small contigs together before placing these in the largest contig containing the reference sequence.

·        Verify that there is no region with a major assembly problem (many mismatches)       

 

After you have received all of the cleanup data for a gene and you are near final assembly the data should be reassembled to generate a final contig. This will give you a new assembly.  This will ensure that the consensus sequence and quality values are updated from all the available data.  Perform all contig joins as necessary to give the best single consensus contig possible. Verify that there is no region with a major assembly problem. This is generally signified by many high quality mismatches (i.e. red bases in consed).

 

 

2.      Confirm/edit the consensus sequence.

 

·        Compare consensus sequence against reference sequence

·        Verify all mismatches/discrepancies

·        Review beginning and end of consensus sequence and edit to match primer sequences

·        Identify regions where the assembly is complex (highly repetitive), poor quality or no data exists.  Record these regions as "Region not scanned for variation".

·        Save the assembly

 

 

Scan through the entire gene and compare the consensus sequence against the reference sequence in the assembly.  At confirmed SNP sites either allele may be present.  There should not be major base discrepancies, though single bases may disagree.  In these cases, view the chromatogram data and verify that consensus sequence is correct.

 

NOTE: If after reviewing the chromatogram data it is clear the consensus sequence should changed, it can be overridden by bringing up a chromatogram sequence and using <M2> (mouse button 2) on the position to be changed. From the subsequent menu choose "Change Consensus". This will change the consensus sequence to the base at that position in the chromatogram.

 

It is especially important to review the consensus sequence at very beginning and end of the at/near the first and last primer positions.  If this sequence differs from the consensus sequence (usually due to low quality data in this region), it should be overridden with the sequence from the assembled reference to match the primers.

 

At this point you should also check for region which may have not been completely scanned.  This may be due to poly(N) tracts truncating a read, a region which was difficult or unable to be amplified by PCR, or other just low quality sequence.  If the region is fairly small (<100 bp) with decent coverage we consider it scanned.  If it is a complete gaps without any data record the coordinates of the consensus so this can be annotated later in the GenBank submission file.

 

After reviewing the consensus sequence save the assembly in Consed.

 

3.      Run polyphred on the final assembly.

 

polyphred -ace <latest ace file> -quality 25 > <gene>.polyphred.<iter>.out

 

·        Polyphred can be run at a lower quality to increase your genotype coverage (usually not lower than 20)

·        If you reduce the quality identify low quality genotypes which have been introduced.  This can be done using compare_genotype.pl

 

4.      Scan quickly through the final assembly for missed polymorphic sites.

 

·        Navigate to all "polyphredRank1" sites and checked those not confirmed.

·        Navigate to all "checkGenotype" sites and confirm.

·        Look for polymorphism tags in the assembly where polyphred has not marked the column  - these could have been missed.

·        Verify diallelicIndel and manualPolysite sites and genotype.

NOTE:  All of these types of polymorphisms are referenced with respect to the positive (forward orientation) strand.

·        Save the assembly and rerun polyphred.

 

 

A) Make sure that no obvious SNP sites have been missed, especially for unconfirmed rank 1 sites (red).

 

B) Look for previous polymorphism tags left on a site but not marked by polyphred in the final assembly. This may mean that no polymorphism exists at that site or that polyphred could not pick up a polymorphism at this site anymore once new data had been added.

 

If polyphred is missing a real polymorphism which was previously marked you should apply a "manualPolySite" on the consensus sequence at this position and "manualGenotype" at each chromatogram position.  Alternatively, one could run polyphred at a lower rank setting to see if the site is called - this can save time by avoiding the need to apply the manualGenotype tag.

 

C) Verify that all "diallelicIndel" and "manualPolySite" tags are located at the correct positions and the "manualGenotype" tags are correct (ConsensusPosition).  This is important because the consensus coordinate may change following reassembly.

 

Both “diallelicIndel” and “manualPolySite” tags require explicit genotypes to be applied to the reads at these positions. 

 

 

 

Diallelic Insertion/Del.

Unmarked Polymorphism

Consensus Tag

diallelicIndel

ManualPolySite

Read Tag

indel

ManualGenotype

 

:

The general format for read tags is:  ConsensusPosition Allele1 Allele2

 

e.g. 200 AG AG

200 - AG

200 - -

 

NOTE: diallelicIndels are also referenced with respect to the positive strand (forward orientation).

 

D) Save the assembly to save any new changes and rerun polyphred

 

 

 

5.      Perform a final visual check and check genotype statistics.

 

·        Run pga_gene_analysis on the final polyphred output file.

·        Verify data using vg with the consed option

·        Resolve any conflicting data.

·        Check the genotype of all singleton and doubleton sites.  These must be confirmed or very high quality.

·        Check all genotypes of sites with significant Hardy-Weinberg scores.

Confirm gentoypes on all sites where the population specific HW value is greater than 3.8.  Abnormally high HW values can be generated if a site is rare in a specific population (<5% allele frequency).

·        Check for differences in the LDE patterns between correlated sites.

·        NOTE:  It helps to run vg and view only common sites (-rare 10)

·        If any changes are made, polyphred and pga_gene_analysis must be rerun.

For the final  assembly the gene should be named as with just the <gene> name.

 

 

A) run pga_gene_analysis on the final polyphred output file.

 

B) The data can be verified visually using vg with the -consed option.

 

** Check the genotype of singleton and doubletons

** Check all sites with Hardy-Weinberg (HW) scores above 3.8 (from the <gene>.hw.txt file).   This might be an indication of allele-specific amplification.

** Check for differences in the LDE patterns between correlated sites.

-It is also useful to run vg to view only common sites when looking at LDE correlations. For the PGA if you wanted to look at sites with > 10% allele frequency use the command:

 

vg -file <gene>.prettybase.txt -rare 10

C) If anything was modified you must repeat the polyphred analysis and pga_gene_analysis.

 

 

Revision History:

mjr 22-April-2001 Wrote initial protocol

mjr 11-May-2001 Edited protocol and added major summary points.

mjr 26-May-2001 Edited protocol to make clearer.

mjr 20-June-2001  Made revisions suggested by CP