Data Submission to GenBank

 

Overview:

All single nucleotide polymorphic sites submitted to dbSNP (see protocol) require a Genbank accession number. Following our SNP discovery phase where we have sequenced multiple individuals across contiguous segments of genomic DNA we usually end up with a near "base-perfect" sequence from this in-depth sequencing. All SNPs identified can be directly linked to a coordinate on the final reference sequence. In addition, this sequence may serve as a locus(gene) specific resource for individual investigators on which one can anchor multiple seuquence features.

 

The submitted reference sequence will detail all known SNP positions and contain 1) the "longest" polymorphic unit at each position (i.e. inserted form versus "deleted form") or 2) the most "common" allele at each polymorphic position (i.e. most frequent allele in our sampled population). Other repetitive sequence elements and protein coding regions will also be annotated. This protocol makes use of a the Sequin GenBank submission program and the accompanying command line program - tbl2asn - to semi-automatically generate a Genbank submission file from our standard output files.

 

 

Related Web sites:

 

http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm

A quick tutorial on how to use the Sequin submission tool.

 

ftp://ftp.ncbi.nlm.nih.gov/sequin/

Download area containing most recent version of Sequin.

 

http://www.ncbi.nlm.nih.gov/Sequin/table.html

Information on how to format data input files for tbl2asn.

 

 

Required Files:

/usr/local/genome/src/sequin_*/pga_template.sbt

<gene>.cdna.fasta

<gene>.protein.fasta

<gene>.repeats.xm

<gene>.freq.txt

<gene>.exons.xm

<hugo_name>.fsa

NM_*.cdna.fasta or

XM_*.cdna.fasta (for the current gene - usually in the "<gene>/fasta_dir")

NP_*.protein.fasta or

XP_*.protein.fasta (for the current gene - usually in the "<gene>/fasta_dir/")

 

 

Required Programs:

tbl2asn

sequin

make_genbanktable.pl

translate

 

 

Abbreviated Procedure:

 

 

All these steps are performed in the <gene>/stats_dir

 

1.      Create a feature table for your GenBank file using make_genbanktable.pl

 

make_genbanktable.pl -xm_repeat <gene>.repeats.xm –frequency <gene>.feq.txt –hugo <HUGO>

 

2.      Use tbl2asn to create your GenBank file.

 

tbl2asn -t /usr/local/genome/src/sequin_*/pga_template.sbt -p . -v

 

Output from this program is:  <HUGO>.sqn

 

3.      Verify your cDNA sequences.

 

This should have been done prior in the final check of the gene. If not follow the steps below:

 

All of the cDNA sequence confirmation should be done in the <gene>/fasta_dir

 

·        Verify the cDNA sequence (<gene>.cdna.fasta) against your final reference sequence (<gene>.refSeq.fasta).  Modify as needed to match reference sequence.

 

4.      View and annotate your preliminary sequence file (<HUGO>.sqn) using the sequin program.

·        Annotate the gene boundaries

 

o       Under the "Annotate" menu select "Genes and Name Regions > Gene".

o       Under the "Gene/General" tab enter the HUGO name into the “Locus” field

o       Select the "Location" tab and enter the beginning (5') and ending (3') coordinates of the gene's mRNA sequence. (see <gene>.exons.xm)

 

·        Annotate the coding regions

 

o        Under the "Annotate" menu select "Coding Regions and Transcripts > CDS".

o        From the "File" menu select "Import Protein FASTA" (<gene>.protein.fasta)

o       Select "Predict Interval" button. Select the "Location" tab and verify that the exon interval were entered for this protein sequence

o       From the "Coding Region/Protein" subtab enter the full description of the gene from the LocusLink in the "Name" field

Annotate alternatively spliced forms of the gene.

 

·        Annotate the mRNA sequence

o       Under the "Annotate" menu select "Coding Regions and Transcripts > mRNA".

o       Under the “mRNA” tab enter the <HUGO> in the "Name" field

o        From the "File" menu select "Import cDNA FASTA". (<gene>.cdna.fasta)

o       Select the "Location" tab and verify that the mRNA intervals were entered for this cDNA sequence.

 

·        Annotate other comments.

o       One common reason for annotation would be to identify regions we have not resequenced completely. The comment "Region not scanned for variation" should be entered.

o       Under the "Annotate" menu select "Bibliographic and Comments > Comment"

o       Select the "Properties/Comment" subtab.

Select the "Location" tab and enter the beginning and ending coordinates over which this comment applies.

 

·        Generate the definition line and validate

o       Under the “Annotate” menu select “Generate Definition”

o       When finished annotating select the “Done” button on the main window.  This will perform an automatic validation of your file, checking for inconsistencies or bad formatting.

 

 

 

5.      Save the Sequin file

·        Save this file file as <HUGO>.genbank.sqn.

·        Export a GenBank file as <HUGO>.genbank.txt

·        E-mail the <HUGO>.genbank.sqn to: gb-sub@ncbi.nlm.nih.gov and

riley@ncbi.nlm.nih.gov

 

 

 

 

Detailed Procedure:

 

1.      Create a feature table for your GenBank file using make_genbanktable.pl

 

make_genbanktable.pl -xm_repeat <gene>.repeats.xm –frequency <gene>.feq.txt –hugo <HUGO>

 

 

This is a perl script written to parse and reformat our standard data files. This program creates a "feature table" to be used in the creation of the GenBank file. The repetitive elements are parsed from the <gene>.repeats.xm file and variation information is parsed from the <gene>.freq.txt file. In the GenBank output file all variations will be annotated with the deletion-type polymorphisms or the rare allele at each site against our standard reference sequence (see Overview). The output file from this program is <hugo_name>.tbl. The table output file must reside in the same location as the final reference sequence <hugo_name>.fsa (see below).

 

2.      Use tbl2asn to create your GenBank file.

 

tbl2asn -t /usr/local/genome/src/sequin_*/pga_template.sbt -p . -v

 

Output from this program is:  <HUGO>.sqn

 

"tbl2asn" is a command line program to generate your GenBank file. It requires a template file containing standard submission information (see Required Files), the gene features table <hugo_name>.tbl, and a FASTA file of the gene sequence you wish to submit <hugo_name>.fsa. This program requires that the table file and the FASTA file must have the same basename (<hugo_name>).

 

Command line parameters for tbl2asn are:

 

tbl2asn -t [template_file] -p [path_to_input_files]

 

-t = template file = /usr/local/genome/src/sequin_*/pga_template.sbt

 

where the sequin_* can be sequin_linux or sequin_solaris depending on your machine.

 

-p specifies the path for the table and FASTA files = -p .  is the current directory

-v = performs a validation

 

Use the -v switch to perform validation on your sequence. An error message will occur if something isn't correctly formatted.

 

This creates a file named <HUGO>.sqn file.

 

 

3.      Verify your cDNA sequences.

 

This should have been done prior in the final check of the gene. If not follow the steps below:

 

All of the cDNA sequence confirmation should be done in the <gene>/fasta_dir

 

·        Verify the cDNA sequence (<gene>.cdna.fasta) against your final reference sequence (<gene>.refSeq.fasta).  Modify as needed to match reference sequence.

 

 

This should have been done prior in the final check of the gene. If not follow the steps below:

 

Prior to annotation generating the final data files it is important to confirm the cDNA sequence. You should have either a NM_* or XM_* file downloaded from LocusLink entry for the gene you are working on. When in doubt it is probably best to use the XM_* file.

 

To verify align the reference sequence to the cDNA sequence:

 

cross_match ../edit_dir/<gene>.refSeq.fasta XM_* -alignments > somefile

 

View the alignment looking for gaps or base differences. If these differences are found, check the latest consed assembly for sequencing/basecalling errors. If the differences are not errors, the cDNA sequence should be changed to reflect the same sequence as the reference sequence. Otherwise you can copy the LocusLink cdna file directly to the standard name:

 

cp XM* <gene>.cdna.fasta

 

where XM_* should be the full name of the file!

 

NOTE:  Some genes may have alignments (i.e. many mismatched bases) over large (100-200 bp) regions.  Determine if these are spurious alignments.  The "good" matching regions of the cDNA should be ordered across the cDNA sequence.   If cross_match is generating some bad alignments you may have to edit the <gene>.exons.xm file and pga_translate my not run correctly.

 

 

Verify format of cDNA file. The file containing the cdna sequence should be named:

 

<gene>.cdna.fasta

 

and the first line (header) of this file must be of the format:

 

><HUGO>.CDS

ACGATTTTAA (etc)

 

Check this and save the file.

 

Copy this file to the stats_dir:

 

cp <gene>.cdna.fasta /<gene>/stats_dir/.

 

Make a protein translation of the cDNA sequence. A protein translation of this sequence will also be needed for the GenBank annotation. It can be generated using the program:

 

translate [-cdna] [-pep]

                <-startpos>

 

where:

-cdna = cDNA sequence you just created <gene>.cdna.fasta

-pep = output file of the amino acid translation

-startpos = user defined starting nucleotide to begin translation (optional)

 

 

The output -pep file should be named: <gene>.protein.fasta

 

The output from "translate" will report:

A) the position of the start codon

B) the position of the stop codon

 

These positions should be consistent will information in the LocusLink cDNA files (NM_* or XM_*). Review the annotation in these files ("CDS positions") and confirm that they are consistent with what is reported.

 

Copy this file to the stats_dir:

 

cp <gene>.protein.fasta

 

 

 

 

4.      View your preliminary sequence file (<HUGO>.sqn) using the sequin program.

·        Annotate the gene boundaries

 

o       Under the "Annotate" menu select "Genes and Name Regions > Gene".

o       Under the "Gene/General" tab enter the HUGO name into the “Locus” field

o       Select the "Location" tab and enter the beginning (5') and ending (3') coordinates of the gene's mRNA sequence. (see <gene>.exons.xm)

 

·        Annotate the coding regions

 

o        Under the "Annotate" menu select "Coding Regions and Transcripts > CDS".

o        From the "File" menu select "Import Protein FASTA" (<gene>.protein.fasta)

o       Select "Predict Interval" button. Select the "Location" tab and verify that the exon interval were entered for this protein sequence

o       From the "Coding Region/Protein" subtab enter the full description of the gene from the LocusLink in the "Name" field

Annotate alternatively spliced forms of the gene.

 

·        Annotate the mRNA sequence

o       Under the "Annotate" menu select "Coding Regions and Transcripts > mRNA".

o       Under the “mRNA” tab enter the <HUGO> in the "Name" field

o        From the "File" menu select "Import cDNA FASTA". (<gene>.cdna.fasta)

o       Select the "Location" tab and verify that the mRNA intervals were entered for this cDNA sequence.

 

·        Annotate other comments.

o       One common reason for annotation would be to identify regions we have not resequenced completely. The comment "Region not scanned for variation" should be entered.

o       Under the "Annotate" menu select "Bibliographic and Comments > Comment"

o       Select the "Properties/Comment" subtab.

o       Select the "Location" tab and enter the beginning and ending coordinates over which this comment applies.

 

·        Generate the definition line and validate

o       Under the “Annotate” menu select “Generate Definition”

o       When finished annotating select the “Done” button on the main window.  This will perform an automatic validation of your file, checking for inconsistencies or bad formatting.

 

 

Start the sequin program by typing:

>sequin

 

On the opening screen select "Read Existing Record". You should now see your file appear in GenBank format with annotation of variations and repeats.

 

Some additional annotation of the GenBank file is required prior to submission.

 

Annotate the Gene

 

Under the "Annotate" menu select "Genes and Name Regions > Gene". In the next window under the "Gene/General" tab enter the <HUGO> into the Locus field. Next select the "Location" tab and enter the beginning (5') and ending (3') coordinates of the gene's mRNA sequence. The coordinates can be found in the <gene>.exons.xm file. Close this window by choosing "Accept". Annotation for the gene should appear in the Features section of the GenBank file.

 

Annotate the Coding Regions

 

Under the "Annotate" menu select "Coding Regions and Transcripts > CDS".  In the next window from the "File" menu select "Import Protein FASTA". Select the file containing the amino acid translation for this gene (<gene>.protein.fasta). The translation should appear in the current window. Next, select "Predict Interval" button. Select the "Location" tab and verify that the exon interval were entered for this protein sequence. Finally, in the "Coding Region/Protein" subtab enter the full description of the gene from the LocusLink (Homo sapiens Official Gene Symbol and Name) in the "Name" field (e.g.  F2R: factor 2 receptor).

 

If a gene is alternatively spliced annotate each form for the coding region and mRNA (below).  Alternatively spliced genes will have multiple entries in LocusLink for there cDNA RefSeq.

 

Annotate the mRNA Sequence

 

The procedure is similar to annotating the Coding Regions above.

 

Under the "Annotate" menu select "Coding Regions and Transcripts > mRNA". In the next window from the "File" menu select "Import cDNA FASTA". Select the file containing the mRNA sequence for this gene (<gene>.cdna.fasta). Select the "Location" tab and verify that the mRNA intervals were entered for this cDNA sequence. Under the mRNA tab enter the <hugo_name> in the "Name" field.

 

Check the coordinates for the mRNA “join” and CDS “join” annotation.  These intron/exon boundary numbers should match at all splice junctions with the exception of either 5’ or 3’ UTR sequence.  If there are any discrepancies between the mRNA and CDS, the CDS coordinates should be used.

 

 

Annotate other comments

 

If needed you can add comments to this file. Under the "Annotate" menu select "Bibliographic and Comments > Comment". In the next screen select the "Properties/Comment" subtab. Enter your free text annotation in the Comment box.

 

One common reason for annotation would be to identify regions we have not resequenced completely. The comment "Region not scanned for variation" should be entered.

 

Finally, select the "Location" tab and enter the beginning and ending coordinates over which this comment applies.

 

5.      Save the Sequin file

·        Save this file file as <HUGO>.genbank.sqn.

·        Export a GenBank file as <HUGO>.genbank.txt

·        E-mail the <HUGO>.genbank.sqn to: gb-sub@ncbi.nlm.nih.gov and

riley@ncbi.nlm.nih.gov

 

Save the GenBank file

 

Save this file file as <HUGO>.genbank.sqn.

 

Also save a flat-file text version of this file by using the "Export GenBank" under the "File" menu. Save this file as <HUGO>.genbank.txt.

 

Submit your GenBank file.

 

When using Sequin, the output file (<HUGO>.genbank.sqn) can be directly submitted to GenBank by electronic mail at: gb-sub@ncbi.nlm.nih.gov

 

We have a personal contact at NCBI who will expediate our submissions. On the Subject line of the email put "Attn: Leigh Riley" and cc: the submission to her at:

 

riley@ncbi.nlm.nih.gov

 

You should receive an instant automatic email reply that GenBank has received your submission and within 5-7 working an accession number and flatfile will be mailed to you for review. This accession number can be used in your dbSNP submission (See protocol).

 

 

Revision History:

mjr 11-Feb-2001 Developed and wrote initial protocol.

mjr 07-March-2001 Revised protocol to reflect filenaming convention and with more detail.

mjr 29-March-2001 Revised protocol for checking cdna sequence and translation

mjr 26-May-2001 Edited protocol for clarity.

mjr 20-June-2001  Made changes suggested by CP.