To Send Data to Michigan:


DATA SUBMITTAL TO MICHIGAN:

We send: sequence data, ola data, a reference sequence, a chimp reference sequence, and an "exceptions" file. SEQUENCE DATA: For each gene, we need a genotype for each sample at every polymorphism. This includes insertions and deletions. Any polymorphism that is not a SNP is given an exception code and is genotyped according to that code. See MDECODE pamphlet for how to name these, plus you can refer to Scott's old work. Also, see below. FORMAT: The files we send are called "source" files and are basically a modified prettybase. Source files should list sample, site allele1 and allele2 separated by commas. EX: RM62,ae00567,A,T *The samples must be MDECODE ID numbers. This means that the identifier we use to sequence must be mapped back to the mdecode ID number. For example, RM62 might correspond to "our" R06. Scott has old .sql files around that can be used to do this. *The site (00567 in this case) is preceded by the two letter MDECODE gene code. See pamphlet for each gene's abbreviation. *Alleles are listed in alphabetical order *All conflicts (XX genotypes) must be changed to NNs before submission Source file should be saved as something like "lp022200.source". We include the two letter MDECODE gene name, the date and ".source" suffix. OLA DATA: For each gene, we need a genotype for each sample at every polymorphism. This includes insertions and deletions. Any polymorphism that is not a SNP is given an exception code and is genotyped according to that code. See MDECODE pamphlet for how to name these, plus you can refer to Scott's old work. Sample ID's are recorded on blocks (see folder "DNA blocks" or Barney) in terms of their Kottke ID numbers. We use the Kottke ID numbers to report the data, so you do not have to change the sample IDs in this case. FORMAT: The files we send are called "source" files and are basically a modified prettybase. Source files should list sample, site allele1 and allele2 separated by commas. EX: 1262,ae00567,A,T *The samples must be Kottke ID numbers. *The site (00567 in this case) is preceded by the two letter MDECODE gene code. See pamphlet for each gene's abbreviation. *Alleles are listed in alphabetical order *All conflicts (XX genotypes) must be changed to NNs before submission Source file should be saved as something like "lp022200.NK.source". We include the two letter MDECODE gene name, the date and ".source" suffix. In the case of OLA, we submit the populations separately, and therefore name the file accordingly. REFERENCE SEQUENCE: This sequence is constructed from OUR consensus sequence. It should not be confused with the Genbank reference sequence. Our sites must be numbered according to this sequence. See Debbie for help with determining the cutoff points. Also, at any polymorphic site, our sequence should contain the most common allele. Must be submitted in fasta format. Named "lp022200.reference". CHIMP REFERENCE SEQUENCE: This sequence is constructed from OUR chimp consensus sequence. See Debbie for help with determining the cutoff points. Must be submitted in fasta format. Named "lp022200.chimp.reference". EXCEPTIONS FILE: This file must list indels and other odd polymorphic pieces of DNA. Each allele is given an exception code and sites are genotyped according to this code. As noted below, these codes are not always easy to figure out. The exceptions will be given a four letter code: E001 to E999. EX: lp00106,E001,106-107insGGC lp00106,E002,106-107insGGCGCC The first example denotes and insertion of GGC between 106 and 107 of the reference sequence. The second example denotes an insertion of a GGC twice between 106 and 107 of the reference sequence. There are many other examples in the MDECODE pamphlet. Debbie and Scott's old files would be good references, as some of this notation is confusing and counterintuitive. GENOTYPING: Each sample is reported in the same way as above, but exception codes are listed in place of alleles. EX: RM62,lp00106,E001,E002 RM25,lp00106,E002,E002 etc. FORMAT: Each exception is listed and the file is saved as "lp022200.exceptions". See Scott's old files for help. SUBMITTING: 1. ftp to supera.hg.med.umich.edu 2. Login as dan (Deborah Ann Nickerson) 3. Password expired every 6 months 4. cd to the gene name, making a new directory if you need to 5. There should be two subdirectories: reference and sequence 6. The sequence directory should contain two subdirectories: phased and unphased. 7. Put the reference sequences (chimp and human) in the reference directory. 8. Ola data, sequence data, and the exceptions file should all be put in the unphased subdirectory of the sequence directory. 9. The following people need to be notified when data is submitted. Check in with Debbie before you send them an email to tell them it is there: Andy Clark: c92@PSU.EDU Charlie Sing: csing@umich.edu Ken Weiss: kgweiss@umich.edu Ken Weiss: KMW4@PSU.EDU Malia Fullerton: smf15@psu.edu