PGA Final Data Formatting

Overview:

This protocol outlines the final checklist for formatting final data to be posted on the PGA website. A gene is generally considered finished when:

  1. All PCR products have been sequenced or attempted multiple times

  1. Greater than 90% of the genotypes for the gene are completed AND

  2. 3. No single SNP site is missing more than 80% of the data.

Information on SNP site statistics can be found in the 'gene'.stats.txt after running <pAnalysis_pgdb.pl> and <ga_pgdb.pl> in the </edit_dir>. This will also produce a standard set of working files starting with the gene name. To run these scripts you must first:

  1. Verify the cdna sequence downloaded from LocusLink against our reference sequence
    run <cross_match> with -alignments option and check for deletions, mismatches, etc...

  1. Run <pAnalysis_pgdb.pl> and <ga_pgdb.pl> to produce the final output files

  2. 3. Review the chimpanzee data using <refcomp> to mark all sites that vary from human reference.

Required Files:

Output files from <ga.pl> in the ( /'gene'/edit_dir )

'gene'.*.txt
'gene'.refSeq.fasta
'gene'.alleles.out
'HUGO'.fsa

Other files (with standard locations):

'GENE'.primers.fasta in the ( /'gene'/fasta_dir )

NM_*.cdna.fasta or
XM_*.cdna.fasta (for the current gene - usually in the "<gene>/fasta_dir

NP_*.protein.fasta or
XP_*.protein.fasta (for the current gene - usually in the "<gene>/fasta_dir/")

It is a good idea to not use your original NM_*.cdna.fasta file for analysis. Instead, copy this file to 'gene'.cdna.fasta

Required Programs:
pAnalysisTables.pl
ga.pl
translate
refcomp


Abbreviated Procedure:


1. Verify your cDNA sequence.

All of the cDNA sequence confirmation should be done in the /'gene'/fasta_dir

NOTE: If the alignment of the cdna against your reference sequence generates poor alignments you will need to edit the 'gene'.exons.xm file and remove these entries.

translate.pl -cdna <gene>.cdna.fasta -pep <gene>.protein.fasta

Copy 'gene'.cdna.fasta and 'gene'.protein.fasta to the 'gene'/stats_dir.



2.Move the final data files to the stats_dir

All of these steps should be performed in the 'gene'/stats_dir

NOTE: If you had to adjust the coordinates of the protein translation when making your <gene>.protein.fasta file or if you have poor alignment entries in the 'gene'.exons.xm file you will need to rerun translate.pl.  If you had to adjust the coordinates of the protein translation start site use <-startpos> switch.  If you had bad alignment entries, they must be edited out of the 'gene'.exons.xm file.

3.Verify mapping of the cSNPs

4.Analysis of chimpanzee data
Copy human reference final reference sequence for comparison with chimp to:

cp ../../stats_dir/'gene'.refSeq.fasta gene /chimp_dir/fasta_dir/.


'gene'.chimp.refcomp.#.out

· Mark fixed difference sites with the polymorphismConfirmed tag.

· After confirming the sites, save the list of polymorphismConfirmed tags to:

'gene'.confirmed.list

· Save the chimp reference sequence to:  'gene'.chimpContig.fasta

·  Generate the final chimp files:

chimpoutput.pl [-refcomp] [-lab] [-hugo]

where
-refcomp = 'gene'.chimp.refcomp.#.out
-lab = 'gene'
name
-hugo = 'HUGO' name

· Copy the final chimp data files to: <gene>/stats_dir.

cp 'HUGO'.chimp* ../stats_dir/.

5. Follow the protocol for GenBank Submission.

6. Data transfer for posting on web.

· Make the directory on droog for the final data:

mkdir /droog/httpd/html/pga/data/'hugo' (use lower case this time)

· Copy all of the final data to the web server

cp /'drive'/'gene'/stats_dir/* droog/httpd/html/pga/data/'hugo'

· run final_data.pl  (in /droog/httpd/html/pga/data/'hugo')

../finalize_web_data.pl -lab 'gene' -hugo 'HUGO' 

· Send email pkeyes@mbt.washington.edu and notify him that the gene is finished.

· After an accession number has been assigned to your genbank submission and has been posted rerun finalize_web_data.pl as below:

../finalize_web_data.pl -lab 'gene' -hugo 'HUGO' -genbank <accession number>


Detailed Procedure:

1. Verify your cDNA sequence.

All of the cDNA sequence confirmation should be done in the fasta_dir

· Verify the cDNA sequence ('gene'.cdna.fasta) against your final reference sequence ('gene'.refSeq.fasta).  Modify as needed to match reference sequence.

NOTE: If the alignment of the cdna against your reference sequence generates “bad” alignments you will need to edit the 'gene'.exons.xm file and remove these entries and rerun <translate.pl>

· Make the protein translation of your cDNA file.

translate -cdna 'gene'.cdna.fasta -pep 'gene'.protein.fasta

· Verify coordinates of start and stop codon against the LocusLink reference file.

· Copy 'gene'.cdna.fasta and 'gene'.protein.fasta to the 'gene'/stats_dir

Prior to annotation generating the final data files it is important to confirm the cDNA sequence. You should have either a NM_* or XM_* file downloaded from LocusLink entry for the gene you are working on. When in doubt it is probably best to use the XM_* file.

To verify align the reference sequence to the cDNA sequence:

cross_match ../edit_dir/'gene'.refSeq.fasta XM_* -alignments > outputfile

View the alignment looking for gaps or base differences. If these differences are found, check the latest consed assembly for sequencing/basecalling errors. If the differences are not errors, the cDNA sequence should be changed to reflect the same sequence as the reference sequence. Otherwise you can copy the LocusLink cdna file directly to the standard name:

cp XM* 'gene'.cdna.fasta

where XM_* should be the full name of the file!

NOTE: Some genes may have alignments (i.e. many mismatched bases) over large (100-200 bp) regions. Determine if these are spurious alignments. The "good" matching regions of the cDNA should be ordered across the cDNA sequence. If cross_match is generating some bad alignments you may have to edit the 'gene'.exons.xm file and <translate.pl> may not run correctly.

Verify format of cDNA file. The file containing the cdna sequence should be named:

'gene'.cdna.fasta

and the first line (header) of this file must be of the format:

>'HUGO'.CDS

ACGATTTTAGGCTATA...

Check this and save the file.

Copy this file to the stats_dir:

cp 'gene'.cdna.fasta /'gene'/stats_dir/.

Make a protein translation of the cDNA sequence. A protein translation of this sequence will also be needed for the GenBank annotation. It can be generated using the program:

translate [-cdna] [-pep]

<-startpos>

where:

-cdna = cDNA sequence you just created <gene>.cdna.fasta

-pep = output file of the amino acid translation

-startpos = user defined starting nucleotide to begin translation (optional)

The output -pep file should be named: 'gene'.protein.fasta

The output from "translate" will report:

A) the position of the start codon

B) the position of the stop codon

These positions should be consistent will information in the LocusLink cDNA files (NM_* or XM_*). Review the annotation in these files ("CDS positions") and confirm that they are consistent with what is reported.

Copy this file to the stats_dir:

cp 'gene'.protein.fasta ../stats_dir/.

2. Move the final data files to the stats_dir

All of these steps should be performed in the <gene>/stats_dir

· Run finish_gene.pl, which automates the formatting of the output files. This program will give error message if any files are not found or errors generated.

NOTE: If you had to adjust the coordinates of the protein translation when making your 'gene'.protein.fasta file or if you have poor alignment entries in the 'gene'.exons.xm file you will need to rerun <translate.pl> If you had to adjust the coordinates of the protein translation start site use <-startpos> switch. If you had bad alignment entries, they must be edited out of the 'gene'.exons.xm file.

The process has been automated to a large degree and can be done by running:

finish_gene.pl [-lab] [-hugo] [-drive] [-project]

where -lab = <gene>

-hugo = <HUGO>

-drive = the hard drive where the data resides (e.g. C16)

-project = egp or pga

This program automates the running of multiple scripts:

copying of the final data file to the stats_dir

running <RepeatMasker> on the final reference sequence ('gene'.refSeq.fasta)

and the following perl scripts:

snpcontext.pl

drawmap.pl

translate.pl

This program will give error message if any files are not found or errors generated.

Note:

The final output from <translate.pl> will show the mapping of all the SNP sites to the cDNA.  It is important to verify that this was done correctly. The output from <translate.pl> will appear last and will report:

1) the position of the start codon

2) the position of the stop codon

These positions should be consistent will information in the LocusLink cDNA files (NM_* or XM_*). Review the annotation in these files and confirm that the "CDS position" are consistent with what is reported by translate.pl


4. Verify mapping of the cSNPs

·Review the 'gene'.exons.xm file and verify that none of the cSNPs reported in 'gene'.csnps.txt map to intron/exon boundaries.


Review the 'gene'.exons.xm file and verify that none of the cSNPs reported in 'gene'.csnps.txt map to intron/exon boundaries which may overlap between between exons. Make sure that all of the cSNPs do not map near the intron/exon boundaries (~5 bp). Any sites mapping near here need to be manually verified. This is an artifact of our mapping process.

5.Analysis of chimpanzee data

·Copy human reference final reference sequence for comparison with chimp to:

cp ../../stats_dir/'gene'.refSeq.fasta 'gene'/chimp_dir/fasta_dir

· Create the human reference sequence using mktrace. Move the human refernce phd file to the phd_dir.

· Run <phredPhrap> on the chimpanzee data.

· Verify the assembly - do joins as necessary

· Run <refcomp> and send results to:

'gene'.chimp.refcomp.#.out

·Mark fixed difference sites with the polymorphismConfirmed tag.

· After confirming the sites, save the list of polymorphismConfirmed tags to:

'gene'.confirmed.list

· Save the chimp reference sequence to: 'gene'.chimpContig.fasta

· Generate the final chimp files:

chimpoutput.pl [-refcomp] [-lab] [-hugo]

where -refcomp = 'gene'.chimp.refcomp.#.out
-lab = 'gene'
-hugo = 'HUGO'

· Copy the final chimp data files to: <gene>/stats_dir.

cp 'HUGO'.chimp* ../../stats_dir/.


A) Copy human reference for comparison with chimp

cp ../../stats_dir/<gene>.refSeq.fasta ../chimp_dir/fasta_dir

B) Create human reference sequence (in <gene>/chimp_dir/fasta_dir)

run "mktrace"

At the prompts:

enter FASTA filename: 'gene'.refSeq.fasta

enter output filename: 'GENE'.FINAL.REF

This will create a file 'GENE'.FINAL.REF.phd.1. Copy this to the /phd_dir

cp 'GENE'.FINAL.REF.phd.1 ../phd_dir/.

C) Assemble the chimpanzee data

In the edit_dir run: <phredPhrap>

D) Run consed to view the assembly.

Join any contigs if necessary.

Verify the orientation of the consensus sequence (Left to right arrow direction of the assembly human reference sequence)

E) Run "refcomp" on the latest *.ace file.

refcomp -ace *.ace.# -quality 25 > 'gene'.chimp.refcomp.#.out

F) Using consed verify all differences between human and chimp.

On the main consed window, choose "Navigate > Custom Navigation" and select the "refcomp.nav" to check each polymorphic site in the list. Apply the "polymorphismConfirmed" tag to each real site.

G) After verifying all sites, create a save the confirmed sites.

Choose custom navigation file by using "Navigate > Tags" from the consed contig window. Choose the polymorphismConfirmed tag.

Save this list to the file:

'gene'.confirmed.list

H) Save the chimp contig consensus sequence while in the consed contig window as:

'gene'.chimpContig.fasta

I) Quit consed.

J) Generate the final chimp files

chimpoutput.pl [-refcomp] [-lab] [-hugo]

where -refcomp = 'gene'.chimp.refcomp.#.out
-lab = 'gene'
-hugo = 'HUGO'

This program outputs:

'HUGO'.chimpsites.txt

'HUGO'.refSeq.fasta

Copy these files to the stats_dir:

cp 'HUGO'* ../../stats_dir/.

6. Follow the protocol for GeneBank Submission.

7. Data transfer for posting on web.

· Make the directory on droog for the final processed data:

mkdir /droog/httpd/html/pga/data/'hugo' (use lower case hugo name)

· Copy all of the final data in stats_dir to the web server

cp /'drive'/'gene'/stats_dir/* droog/httpd/html/pga/data/'hugo'

·  run final_data.pl (in /droog/httpd/html/pga/data/'hugo')

../finalize_web_data.pl -lab 'gene' -hugo 'HUGO'

· Send email pkeyes@mbt.washington.edu and notify him that the gene is finished.

· After an accession number has been assigned to your genbank submission and has been posted rerun <finalize_web_data.pl> as:

../finalize_web_data.pl -lab 'gene' -hugo 'HUGO' -genbank <accession number>

A) cp 'gene'/stats_dir /droog/httpd/html/pga/data/'gene'_stats

B) run final_data.pl

../final_data.pl -lab 'gene' -hugo 'HUGO'

C) Inform the webmaster (pkeyes@mbt.washington.edu) that the completed gene should be moved from the "Genes in Progress" list to the "Finished Genes" on the PGA web page. 

D) Once the GenBank data is submitted, a confirmation of submission will be emailed to the person who sent in the submission. This confirmation will include an Accession Number, along with a date that the file will be posted to GenBank. Around that date, verify that the Accession Number is available on GenBank. Once this is available you can rerun <finalize_web_data.pl> to create the link to GenBank.

../finalize_web_data.pl -lab 'gene' -hugo 'HUGO' -genbank <accession number>

Revision History:

mjr 22-April-2001 Wrote initial protocol.

mjr 26-May-2001  Edited protocol to make it clearer.

mjr 19-June-2001  Made changes suggested by CP

nch 13-Dec-2002 Updated outdated commands and expanded descriptions