SNP_Identification_Protocol.doc Analyst Protocol for SNP Identification All transfers are done using the Macintosh computers attached to the sequencers. It is best to try and analyze all data files you transfer on the same day. This helps to catch tracking errors. 1. CAREFULLY double-check tracking on gel image. Pay close attention to tracking on the outermost lanes. 2. Using the FETCH program, login to a workstation. A shortcut within FETCH should be available for each workstation. Valid workstations will usually be: 1) haver.mbt.washington.edu 2) amici.mbt.washington.edu 3) saheli.mbt.washington.edu 3. Login to each workstation with your username and password. For example: username: yourname password: ****** 4. Navigate to the gene_name/new_chromats directory using FETCH. This can be done by double- clicking on the directory where you want to go within the FETCH transfer window. For example: If you are working on the lpl gene go to the directory lpl/new_chromats. This will usually have to be done by moving up one level from your login directory (using ".." within FETCH) and going into the lpl folder icon and then the new_chromats folder icon. The standard directory structure for each gene is shown below: gene_dir | ------------------------------------------------------------------------------------- | | | | | | new_chromats edit_dir chromat_dir phd_dir poly_dir analysis_dir | -------------------------------------------------- | | | | edit_dir chromat_dir phd_dir poly_dir 5. Drag files to be transferred into the FETCH transfer window. The transfer should begin immediately. 6. When the transfer is finished, quit the FETCH program. ========== You can now login to a workstation by using 1) an X-terminal 2) a Macintosh with eXodus (X-terminal emulation software) or 3) a workstation directly. 7. At this point all work is done on the workstation. Again login using your own personal username and password. 8. Change your directory into the gene_name/new_chromats directory. This can be done from your login directory by using the commands: prompt>cd ../gene_name/new_chromats Now type: prompt>moveChromats.pl This program will move all the new data files you just transferred into the chromat_dir while renaming and compressing all files. 9. Change directories into the gene_name/analysis_dir/edit_dir. This can be done by typing: prompt>cd ../analysis_dir/edit_dir (when you are in the new_chromats directory) **All your work must be done in this directory.** 10. To look at your current data (from today) files type: prompt>vp -project -days 0 Note: The -days switch counts backward from the current date. This is why -days 0 = today. This is the only command you should need to view data on the same day you transferred. This will begin a series of programs (Phred, Phrap, PolyPhred, and Consed) which will allow you to view your data. Other advanced options: You can also query data by strings within its filename. This is done using the -data switch: Example: prompt> vp -project -data '010' = will get all files with the '010' string (i.e. all files from the first product/forward read. prompt> vp -project -data '010|011' = will get all files from product 01, both forward and reverse reads. **This maybe be useful if you sequence a few samples of one product on one day and a more samples on a subsequent day. ========== Work in this section will be done in the program Consed. This will allow you to view the assembly of all your chromatograms and make editing changes. 11. After running 'vp', Consed should automatically open a window asking you to open an ".ace" assembly file. There should be only a few choices. Generally you should choose the file: gene_name.fasta.screen.ace.1 This should be your most recent data you just transferred. 12. A window should next come up with "contigs" in this ".ace" file that contain chromatograms for your most recent work. One contig should be longer than all the others (equal to the length of the gene you are sequencing) and should contain most of the chromatograms you just transferred. Select that contig. **If you feel your assembly is incorrect -- i.e. many contigs, or many mismatching bases within a contig -- consult with someone else for help**. If you have a few contigs (3-5) look at each to check the quality of the reads. If contigs have low quality reads and don't assemble with the reference sequence contig generally ignore them. They probably are failed reactions and will come out in your cleanup report. If you have high quality reads try to join them with the reference sequence contig (See below) Joining Contigs. 13. If you have more than one contig (other than your contig containing the reference sequence) you can try to combine the other contigs with your reference sequence contig. A. View the contig you want to join with the contig containing the reference sequence. B. Select a sequence near the beginning or end of your contig that has good sequence quality (i.e. areas highlighted in white). C. Go to the main Consed window and select "Search for String" D. Type in (or copy) the sequence you would like to find. Select "OK" * Note: to easily copy a sequence, swipe it with your cursor to highlight and then go to the are where you would like to paste it and click M2. E. A results window will popup hopefully with a result from the contig number containing the reference sequence and one from the contig you are currently using. Both results should say "uncomplemented". If they don't you need to use the "Comp Contig" to reverse one or both of the contigs. F. Click on the first selection and it will position you in the consensus sequence of that contig. G. Go back to the search results window and click on the next selection. Again you should be positioned in the consensus sequence of the second contig. ** It is important to be in the consensus sequence of each contig ** H. Select "Compare Contig" from the first contig window. A new window should pop up showing your sequence. I. Select "Compare Contig" from the second contig window. This sequence should be below and nearly aligned to the first sequence. Click on the "Align" button. J. Your contigs should be aligned (possibly with some mismatches shown as "X"). If you scroll in this window most bases should align perfectly. K. In the alignment window select "Join contigs". After a second you should have a new contig window appear, which is numbered +1 from your highest numbered contig. (e.g. if you have 2 contigs (Contig1 and Contig2) the new joined contig should be numbered Contig3 and Contig1 and Contig2 should have disappeared. L. This process can be repeated for each individual contig until they are all aligned with the reference sequence contig. M. At this point verify again the orientation and consistency of the new reads placed into this contig (see below). ** If you do any "joining" contigs you will have to save the new assembly and quit Consed before you move onto "SNP Identification and Genotype Verification " (below). YOUR FIRST OBJECTIVE IS TO CONFIRM THE VALIDITY AND CONSISTENCY OF THE ASSEMBLY CONTAINING YOUR MOST RECENT CHROMATOGRAMS. 14. Make sure that all of your sequencing reads are in the correct orientation. The reference sequence should be directed left _ right (or in the "forward" orientation). (i.e. the arrow showing the orientation of that read should point left _ right.). If this is not the case, use the "Comp Contig" button to reverse the orientation of the contig. Compare the "primer" designation from the read name against the orientation of the reference sequence. All "forward" reads should be orientated left _ right, while all reverse reverse read right _ left (i.e. the direction of the arrow.) 15. Make sure that all of your sequencing reads are assembling at the correct position in the reference sequence and in accordance with the respective primer. Check "tags" within the reference sequence. This is done by scrolling in the contig window until you see highlighted regions (about 20 bp) in the reference sequence. By using the 3rd mouse button (M3) you will see a menu. Select "show tag details". Verify your sequencing read against that primer number. 16. If your contig passes these two checks (Steps 14 and 15) then move onto " SNP Identification and Genotype Verification", otherwise try to figure out why some read may have assembled incorrectly -- Is is due to lane tracking? Sample loading problems? Sample handling mix-ups? **DO NOT** do anymore analysis until this has been sorted out. If you remove/replace any files due to errors you will have rerun 'vp' (i.e. Phred, Phrap, PolyPhred). ========== SNP Identification and Genotype Verification Now that you have verified the validity and consistency of your contigs, you will be able to go through each column marked by PolyPhred to verify its polymorphism identification. If you did any joining of contigs you must rerun PolyPhred. If you did a join and accepted the filename as suggested by Consed you should have the filename shown below (gene_name.fasta.screen.ace.2). To run PolyPhred: prompt> polyphred -ace gene_name.fasta.screen.ace.2 > gene_name.polyphred.out prompt> update_tags_and_ace.pl -ace gene_name.fasta.screen.ace.2 If you did a join of your contigs and/or reran PolyPhred, to do your editing you MUST use the program consed_edit (otherwise your work will not be saved). Simply type: prompt> consed_edit If you didn't do any joining you can simply work in the Consed which is running from the program "vp" (It will save your data automatically). 1. In order to view polymorphic sites tagged by consed you need to open the contig assembly window. Note the red, orange, and green tags which appear on the consensus sequence. These are rank tags applied by PolyPhred (you may need to move around in the window). The tags appearing on the aligned reads below the consensus should be should be either purple (homozygotes) or pink (heterozygotes). Your job is to look at each column marked as a polymorphism and accept or reject it. 2. You can navigate to each tag by moving the scroll bar at the bottom of the window or using the arrows (>, >> or <, <<) at the bottom of the window. 3. To evaluate a column, you should choose a read marked as heterozygous (pink) and another marked as homozygous (purple) and compare them. Selecting a read is done by using mouse button 2 (M2) and clicking on the read and position where you want to view the chromatogram. 4. When comparing chromatograms between putative homozygotes and heterozygotes, you are looking for two characteristics: 1) a drop of 50% in the peak height of the heterozygote 2) the appearance of a "strong" secondary peak in the heterozygote. You should usually check a minimum of two heterozygotes to verify that a site is a "real" polymorphic site. To add chromatograms to your window, use the M2 button to select that read. It should then appear aligned with the other chromatograms. 5. Another easy case is when you have a column with two homozygotes which have a different base at the same position. One of these base should also look red. This is a sign that it is mismatched from the consensus sequence. You should also compare these homozygote mismatches against each other. Ideally, it is nice to compare both types of homozygotes against a heterozygote, though this may not always be possible, depending on the genotypes of the individuals you are viewing. 6. Once you have built up enough confidence that the site is correctly marked as a real polyphmorphism by PolyPhred you have to "tag" that column. You only need to tag one of the reads in the column. Select any of the reads at the polymorphic site using M2. This should bring up the chromatogram. At the site of the polymorphism (i.e. on the purple or pink tag) click M2 again. This will bring up a menu for applying a tag. Choose "Add Tag". A window with valid tags will be shown. Choose "realPolymorphism". This tag will automatically be applied and should now appear dark purple at the site where the heterozygote (pink) or homozygote (purple) once appeared. Remember you only need to mark one read at a polymophic site with the "realPolymorphism" tag. 7. If you have looked at a site and tried to verify it by doing multiple comparisons and still can't accept or reject it you can put a "comment" tag on a read. This is done the same way as applying the "realPolymorphism" tag except that you will be presented with a dialogue box in which to write a comment. This comment can be reviewed by someone else at a later date. 8. Once you have verified a site as real and applied the "realPolymorphism" tag, you need to check every read to verify the genotype which was applied. One easy shortcut for doing this is to bring up all traces in a particular column and just run down that site in all individuals. 9. Using M3, click on the consensus sequence at the position you are interested in (it should have a red, green, or orange) tag. A menu will pop up and select the "Display traces for all reads" item. All the chromatograms should be displayed and you can scroll down looking at each one. 10. If you see any sites where you think a heterozygote was called homozygote (or vice versa) you can edit the genotype tag for that read. This is done the same way as applying a "realPolymorphism" or comment tag. That is click with M2 on the tagged sequence and bring up the tags menu. Select "Add Tag". This time however select the appropriate genotype tag. You should see selections for all combinations of homozygote and heterozygote genotype tags (i.e. homozygoteAA, homozygoteTTŠ., heterozygoteAC, heterozygoteAG, Š). Select the correct tag for the genotype you would like to correct. It should then appear on the chromatogram as a dark purple tag. 11. Once you are finished checking all sites you can quit Consed (in the main window). You will be prompted to save your file (if you made any changes) -- choose "Save before quitting". Accept the default filename Consed gives to your assembly.