REPEATMASKER

usage: RepeatMasker [-options]

species options
-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-cow same as -mam (historical)
-ar(abidopsis) masks repeats found in Arabidopsis thaliana
-dr(osophila) masks repeats found in Drosophilas
-lib [filename] allows usage of a custom library (e.g. from another species)

repeat options
-nolow does not mask low_complexity DNA or simple repeats
-l(ow) same as nolow (historical)
-(no)int only masks low complex/simple repeats (no interspersed repeats)
-alu only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number] masks only those repeats that are less than [number] percent
diverged from the consensus sequence
-cutoff [number] sets cutoff score for masking repeats when using -lib
(default cutoff 225)

running options
-pa(rallel) number of processors to use in parallel (only works for batch files or
[number] sequences larger than 50 kb)
-q quick search; 5-10% less sensitive, 3-4 times faster than default
-qq rush job; about 10% less sensitive,
-s slow search; 0-5% more sensitive, 2.5 times slower than default
-gc [number] use matrices calculated for 'number' percentage background GC level
-gccalc program calculates the GC content even for batch files/small seqs
-frag [number] maximum sequence length masked without fragmenting (default 51000)
-nocut skips the steps in which repeats are excised
-noisy prints cross_match progress report to screen (defaults to .stderr file)

output options
-a shows the alignments in a .align output file; -ali(gnments) also works
-id adds a column to the .out file table with a unique number for each
inserted element (fragments of the same element generally have the same id)
-inv alignments are presented in the orientation of the repeat (with option -a)
-cut saves a sequence (in file.cut) from which full-length repeats are excised
-small returns complete .masked sequence in lower case
-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked
-x returns repetitive regions masked with Xs rather than Ns
-poly reports simple repeats that may be polymorphic (in file.poly)
-ace creates an additional output file in ACeDB format
-gff creates a General Feature Finding format output
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-fixed creates an (old style) annotation file with fixed width columns


RepeatMasker is a program that screens DNA sequences for low
complexity DNA sequences and interspersed repeats. The output of the
program is a detailed annotation of the repeats that are present in
the query sequence as well as a modified version of the query sequence
in which all the annotated repeats have been masked (default: replaced
by Ns). On average, about 50% of a human genomic DNA sequence is
masked by the program. Sequence comparisons in RepeatMasker are
performed by the program cross_match, an efficient implementation of
the Smith-Waterman-Gotoh algorithm developed by Phil Green.


Input format:

Sequences have to be in the fasta format:

>sequencename all kind of info
AGCGATCGCATCGAGCGCATTCGCATGGGG
>sequencename2 all kind of info
GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA

The program accepts most common erroneous 'almost fasta format' and
raw sequence files, but does not yet work with other formats (GenBank,
Staden, etc.).

You can use RepeatMasker on a file containing multiple fasta format
sequences and on multiple sequence files at the same time:

RepeatMasker *.fasta

This command will mask all files that end with .fasta in the current
directory and give separate reports for each file. Note that if you
have multiple small sequences it is considerably faster to run
RepeatMasker on one batch file than on many single sequence files. The
summary file will be more informative as well. However, analysis on
single files (larger than 2 kb each) can be slightly more accurate,
since GC levels for each sequence will be calculated and used to
choose appropriate parameters.


Output:

RepeatMasker returns a .masked file containing the query sequence(s)
with all identified repeats and low complexity sequences masked. These
masked sequences are listed and annotated in the .out file. The masked
sequences are returned in the same order as they are in the submitted
file, whereas the sequences are presented alphabetically in the
annotation table. The .tbl file is a summary of the repeat content of
the analyzed sequence.


Species options

The default settings of RepeatMasker are for masking a primate (human)
sequence.
Interspersed repeats are specific to a (group of) species, dependent
on the time of activity of the source transposable element. Less than
half of the repeats identified in human DNA are specific to primates,
i.e. they amplified after the eukaryotic radiation some 100 million
years ago. Most repeats that can be identified in mouse DNA are
specific to rodents, due to higher activity and faster mutation rates
in the rodent lineage. RepeatMasker has separate protocols optimized
for analysis of rodent and primate genomes. Interspersed repeats in
other mammals have not been so well catalogued as yet. Among these,
the program performs best for artiodactyls, for which most repeat data
are available (explaining -cow).

The numbers of different repeat consensus sequences against which queries of
different species are compared gives an impression of how far the different
libraries are developed:

option species # of consensi total bp
[default] primates 586 729736
-m/-mus/rod(ent) rodents 496 558180
-mam(mal)/-cow other mammals 401 430152
-lib xenopus.lib Xenopus 25 21678
-lib vertebrates.lib 'other vertebrates' 43 57514
-dr(osophila) Drosophila 115 338063
-lib elegans.lib C. elegans 94 111573
-ar(abidopsis) Arabidopsis 210 703916
-lib grasses.lib maize, rice 30 88780

Note that the majority of sequences against which rodent and other
mammalian queries are compared are repeats that have been identified
in the human genome but are thought to predate the mammalian radiation.

Six other libraries were extracted from the 'RepBase Update' fasta
libraries with very limited curation. The (C.) elegans, Arabidopsis,
and Drosophila, are getting to be respectable, thanks to the efforts
of the G.I.R.I people. The Xenopus, vertebrate (mostly chicken,
salmon, zebrafish, puffer) and grasses (maize and rice) libraries are
still fetal. No overview (.tbl) file will be created when the option
-lib is used.



-lib
With the -lib option you can specify a custom library of sequences to
be masked in the query. The library file needs to contain sequences in
fasta format and should be put where the other RepeatMasker libraries
are. I've provided libraries for some vertebrate (vertebrate.lib) and
grasses (grasses.lib) repeats, which are not yet fully integrated and
have to be accessed by using the -lib option.

'RepeatMasker -lib vertebrate.lib bigfrog.seq'
will mask all sequences similar to repeats in the vertebrate database
as well as all low complexity and simple repetitive DNA in
"bigfrog.seq". No '.tbl' file will be produced. Databases of
repetitive sequences in other organisms are distributed by the Genetic
Information Research Institute (GIRI) and are accessible at
(http://www.girinst.org/~server/repbase.html).


-cutoff
When using a local library you may want to change the minimum score
for reporting a match. The default is 225, lowering it below 180-200
will usually start to give you significant numbers of false matches,
raising it to 250 will guarantee that all matches are real.


Masking options

-nolow / -l(ow)
With the option -nolow or -l(ow) only interspersed repeats are
masked. By default simple tandem repeats and low complexity
(polypurine, AT-rich) regions are masked besides the interspersed
repeats. For database searches the default setting is recommended, but
sometimes, e.g. when using the masked sequence to predict the presence
of exons, it may be better to skip the low complexity masking.

-noint / -int
When using the -noint or -int option only low complexity DNA and
simple repeats will be masked in the query sequence ("minus
interspersed repeats").
Since the 03032000 release, A-rich simple repeats derived from the
poly A tails of SINEs and LINES are merged with the annotation of the
SINE or LINE (i.e. you can't tell there is a simple repeat). Thus, if
you're interested in finding the location of potetnially polymorphic
simple repeats, this option is recommended.

-alu
-div
You can limit the masking and annotation to (primate) Alu repeats with
the -alu option and to a subset of less diverged (younger) repeats
with the option -div. For example,
"RepeatMasker -div 20 -mus mysequence"
will mask only those rodent repeats and simple repeats that are less
than 20% diverged from the consensus sequence and
"RepeatMasker -div 10 -alu mysequence"
will mask Alus that are less than 10% diverged from the Alu consensus
sequences and no other repeats.

The -div option may be used to limit the masking to those repeats that
are either specific to primates or another mammalian order for use in
subsequent comparison of orthologous mammalian loci. On average,
interspersed repeats have diverged 18% in human (~35% in mouse) from
their consensus since the mammalian orders separated. Note that this
method is rather crude, mostly since the range of deterioration of
repeats of the same age is wide; some order specific repeats may
remain unmasked and shared repeats may be masked.


Running options

RepeatMasker can be run at four different sensitivity/speed levels,
with the option -q providing quick (less sensitive) and -s slow
(sensitive) results (see "Sensitivity and Speed" below). The option
-qq has been added for when you're in a frightful hurry.


-frag
RepeatMasker transparently fragments sequences over 51 kb in fragments
of 51 kb with 1 kb overlaps. Similarly, sequence batches containing
more than 51 kb are subdivided in batches of less than 51 kb. The
-frag option allows one to change the size of these fragments. The
only visible effect of the fragmentation is in the alignment files,
where alignments at the edges of the fragments can be duplicated
and/or truncated. Fragmentation was implemented to allow the size of
sequences and sequence batches to be unlimited. Note that RepeatMasker
(usually) will not croak if cross_match runs out of memory; it will
redo the failed search with a higher word length (or initially a
smaller 'bandwidth' for MIR comparisons) until it succeeds. Thus,
masking (smaller) fragments can lead to slightly more sensitive
comparisons, being less likely to run out of memory. It also can
improve repeat detection when a genomic sequence contains regions of
DNA with significantly different GC levels (isochores), since sets of
scoring matrices are chosen based on the GC level of a fragment.


-pa(rallel)
For sequences over 50 kb long or files wit multiple sequences,
RepeatMasker can use multiple processors. When you type:

RepeatMasker -par 10

A batch file of sequences will run with up to 10 sequences at the
time, until all sequences are done, while a file with one large
sequence will analyze the sequence in up to 10 fragments at the same
time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all
sequences over 100 kb are divided in 33-66 kb fragments). For the
batch files no minimum size exists. Thus,

If contains: RM runs in parallel:
one 60 kb seq two 30 kb fragments
one 400 kb seq ten 40 kb fragments
one 1 Mb seq ten 50 kb fragments, twice
ten 500 bp sequences ten 500 bp sequences
two 500 kb sequences two 500 kb sequences

Processing of the detected matches takes place after all batches or
fragments have been cross matched witht the databases.
Beware that, generally, you have a limited number of processor IDs
allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're
allotted 64 user PIDs, you can 'only' run 16 fragments/batches in
parallel.


-gc
-gccalc
Neutral mutation patterns differ significantly depending on the GC
richness of a locus and we have calculated optimal scoring matrices
for the alignment to consensus sequences in a range of background GC
levels. Usually, RepeatMasker calculates the percentage of the
sequence consisting of Gs and Cs and uses the appropriate matrices.
However, the program defaults to using 'average' 43% GC matrices when
the query is shorter than 2000 bp or a batch file is analyzed. Short
sequences are less likely to share the GC level of the locus. For
example, CpG islands and exons are more GC rich than the surrounding
DNA, whereas a LINE1 element usually is more AT rich than the
background. In a batch file, RepeatMasker analyses all sequences
together with the same matrices. The percentage GC in all the
sequences combined may be inappropriate for some sequence entries;
using high GC level matrices in AT rich sequences (and vice versa) may
result in false masking.
One can override this behavior in two ways.
With the option -gc you can set the GC level to a certain percentage;
'RepeatMasker -gc 37 mybatchofsequences.fa' lets the program use
matrices appropriate for 37% GC background. The batch could, for
example, contain ESTs from a single locus with a known GC level.
Alternatively, the -gccalc option forces RepeatMasker to use the
actual GC level of a short sequence or the average GC level of a batch
of sequences. The latter sequences, for example, may be contigs in a
sequencing project.


-noisy
RepeatMasker used to print the voluminous cross_match progress reports
to the screen. In the Dec 1998 version this output is stored in a
.stderr file and a more informative much smaller progress report is
printed to the screen. The option -noisy allows one to see the
cross-match reports coming by on the screen.



Output options

-a / -ali(gnments)
Alignments are saved in a .align file when using the option -a. They
are shown in the orientation of the query sequence, unless you use the
option -inv as well, which will return alignments in the orientation
of the repeats (see 'Alignments' below).

-cut
The option -cut lets the program save a file which contains an
intermediate sequence in the masking progress. In this sequence all
full-length elements (except for LINE1), young LINE1 3' ends, and
close to perfect simple repeats are deleted. This process is currently
only performed for human and rodent DNA. Another option will grow out
of this that returns a sequence in which only primate specific or
rodent specific repeats are deleted, allowing superior alignments of
human and mouse orthologous sequences.

The option -nocut skips the above deletion step in the default
procedure for human and rodent queries. RepeatMasker is generally more
sensitive including the deletion step as it can unearth older repeats
that were interrupted by younger elements.


-x
When -x is used the repeat sequences are replaced by Xs instead of
Ns. The latter allows one to distinguish the masked areas from
possibly existing ambiguous bases in the original sequence. However,
when running BLAST searches (and maybe other programs) Xs are deleted
out of the query and the returned BLAST matches will have position
numbers not necessarily corresponding to that of the original
sequence.

-xsmall
When the option -xsmall is used a sequence is returned in the .masked
file in which repeat regions are in lower case and non-repetitive
regions are in capitals.

-poly
You can get a list of potentially polymorphic microsatellites with the
option -poly. This is simply a subset of the list in .out, with
dimeric to tetrameric repeats less than 10 % diverged from perfection.


-id
This option adds an extra column to the annotation table (.out),
displaying a unique number (ID) for each integrated element. Thus,
fragments of a single element, separated from each other by subsequent
insertions of other elements, deletions or recombinations, carry the
same number. This feature allows better interpretation of the data and
should greatly help proper graphical display of the repeats. The
column follows all other columns, except for the (relatively rare)
indication that an annotation overlaps another annotation (*).


-fixed
Since April 1999 the column widths in the annotation table are
adjusted to the maximum length of any string occurring in a column;
this allows long sequence names to be spelled out completely.
Previously, a fixed column width table was returned, which can still
be obtained by using the -fixed option. Parsing should not be effected
by this change of default behavior, as the same number of columns with
the same formatted text are still separated by white-space.


-xm
When using the -xm option an additional output file (.out.xm) is
created that contains the same information as the .out file (excluding
the low-complexity/simple DNA), but then in the original cross_match
format. This output is harder to read but there are programs that
require the exact cross_match output format.


-u
The script ProcessRepeats adjusts the original RepeatMasker output so
that the annotation more closely reflects reality. With the option -u
a .ori.out file is created that contains the original (but sorted)
cross_match summary lines.


-ace
With the -ace option an .ace file is created by the script. This is
merely a suggestion. The columns in the table currently are:

Motif_homol RepeatMasker(method)



-gff
A .gff file is created by the script with the annotation in 'General
Feature Finding' format. See http://www.sanger.ac.uk/Software/GFF for
details. The current output follows a Sanger convention:

RepeatMasker Similarity
. Target "Motif:" consensus>

In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match
is against an Alu. I don't know why.



ProcessRepeats options

When you have already run RepeatMasker and find that you need a
differently formatted annotation file, you do not want the low
complexity and simple repeat matches displayed in the .out file(s), or
you want a list of possibly polymorphic simple repeats, you only need
to rerun ProcessRepeats on the .cat file(s), which will take just a
small fraction of the time required to rerun RepeatMasker. E.g.:

ProcessRepeats -mus -xm -low -id myhumongousmousesequence.cat

The -mus option is necessary, since repeats are processed differently
for rodent and primate queries. Note that the .out file will be
overwritten and, in this case, will not contain information on simple
repeats and low complexity DNA anymore (if you want to keep both,
rename the original outfile). The -id option adds the unique ID
column, that you may have forgotten to include in the RepeatMasker
command line. The .tbl file will not be overwritten (for this the -tbl
option needs to be used).

The options available for ProcessRepeats are:

default settings are for handling a human sequence .cat file
-mus adjusts the processing and .tbl file for rodent repeats
-cow adjusts the .tbl file to artiodactyl repeats (also used for other mammals)
-ar(abidopsis) adjusts .tbl file for Arabidopsis thaliana repeats
-dr(osophila) adjusts .tbl file for Drosophilas repeats
-lib does not produce .tbl file, skips most of processing

-id adds a column to the .out file table with a unique number for each
inserted element (fragments of the same element generally have the same id)
-l(ow) does not print out simple repeats or low_complexity DNA
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-ace creates an additional output file in ACeDB format
-poly creates an output file listing only potentially polymorphic simple repeats
-fixed creates an (old style) annotation file with fixed width columns
-neg results in sometimes negative coordinates for L1 elements; all L1 subfamilies
are aligned over the ORF2 region, sometimes improving interpretation of data

the following are only necessary if you have accidently deleted a .tbl or .align file

-a shows the alignments in a .align output file (only works when
RepeatMasker has run with the -a option)
-tbl creates/overwrites a .tbl file (need to provide information below as well)
-gc GC content of query sequence
-length total sequence length in file submitted to RepeatMasker
-masked number of bases masked in query sequence
-sp RepeatMasker parameter specifications
-ver version RepeatMasker
last five options for use in .tbl file; info provided by RepeatMasker script\n";



Sensitivity and speed

The program can be run at four levels of sensitivity. The only
difference between these settings is the minimum match or word length
in the initial (not quite) hashing step of the cross_match program
(see the cross_match/phrap documentation). The "slow" setting will
find and mask 0-5% more repetitive DNA sequences than by default,
whereas the "quick" settings miss 5-10% of the sequences masked by
default. The alignments may extend more or be somewhat more accurate
in the more sensitive settings as well. The -s (slow/sensitive)
setting will take on average 2.5 x as long as the default setting,
whereas the -q (quick) setting is 3 to 6 times faster than the
default.

Because of the continuing growth of the human repeat databases,
RepeatMasker's speed, when using the same settings, has actually
decreased over time (a new trick in cross_match in 1998 doubled the
speed, but it could not compete wth the growth of the database). I've
added a -qq (rush job) option that runs with the same speed as the old
-q option, but is slighlty less sensitive again. Hopefully, your
computer has become faster since 1996 to keep up with the database
growth. Also, the use of multiple processors eases the pain for large
sequences and any batch files.

On average, with default settings, a 40 kb human cosmid is analyzed in
less than 4 minutes (user time) on a Digital UNIX V4.0D.

Seconds user time on a Digital UNIX V4.0D
setting
length -qq -q def -s
5 kb 8 14 29 64
10 kb 11 21 57 134
20 kb 16 33 117 290
40 kb 25 55 227 572
80 kb 41 99 448 1145

The relative speed of sequences of different lengths is dependent on
the computer; our server is 'better' in short sequences than this DEC.
The speed is also dependent on the repeat content of the sequence. For
example, LINE rich sequences are somewhat slower, Alu rich sequences
are faster analyzed.

If you have several shorter sequences it is much faster to run
RepeatMasker on a batch file (all sequences in one file). On above
computer, in the rush mode, a batch of 10 5 kb sequences is analyzed
in 23 seconds, 20 5kb in 34 sec., etc.

The user time for sequences or sequence batches over 100 kb (or
whatever the fragment size is set to) is linearly related to the
length of the query due to the fragmentation of the query sequence.

The increase in speed by using multiple processors is dependent on the
the usage of the computer and above non-linear relationships of
sequence length and processing time. However, under the right
circumstances,using 2 processors can increase the speed close to
twofold, because all time consuming processes are performed in
parallel.


Scoring matrices

We have calculated statistically optimal scoring matrices for the
alignment of neutrally diverging (non-selected) sequences in human DNA
to their original sequence. These matrices have been in use since the
May 1998 release. The matrices were derived from alignments of DNA
transposon fossils to their consensus sequences (Arian Smit, Arnie Kas
& Phil Green, in preparation...). A series of different matrices are
used dependent on the divergence level (14-25%) of the repeats and the
background GC level (35-53%, neutral mutation patterns differ
significantly in different isochores).

These matrices are (close to) optimal for human genomic sequences
longer than 10 kb, for which length the GC level usually is
representative of the isochore in which the sequence lives. However,
the GC level of small fragments can diverge a lot from the surrounding
(e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich
LINE1 element) and RepeatMasker defaults to using matrices derived for
a 43% GC background when a sequence is shorter than 2000 bp or when a
batch file is submitted. When the appropriate background GC level is
known, this can be entered with the -gc option.

(Note that these matrices are an integral portion of RepeatMasker and
are covered under the same restrictions as the scripts and databases
as described in the signed software agreement).


Selectivity and matches to coding sequences

The cutoff Smith-Waterman scores for masking interspersed repeats are
conservative, since masking of one short potentially interesting
region generally is more harmful than not masking a number of hard to
find matches. If there are any false matches, they tend to have
scores close to the cutoff, which is 225 for most repeats, 300 for the
low-complexity LINE1 search*, and 180 for the very old MIR, LINE2 and
MER5 sequences.
* most LINE1s are detected with a 225 cut-off, but in one step in
RepeatMasker the low-complexity score adjustment is turned off to find
ancient A-rich L1 elements.

We tested for the occurrence of false matches in randomized and in
inverted (but not complemented) DNA. To check a variety of conditions,
four 150 to 400 kb DNA fragments were analyzed ranging in GC level
from 36% to 54%. To retain seeds for Smith Waterman alignments,
randomization was done at the 10 bp word level. Note that the inverted
sequences retain the low complexity and simple repeat patterns of the
original sequences. Even at sensitive settings, for which false
matches are most likely, this version of RepeatMasker reported no
(false) matches at all to interspersed repeats in the randomized or
inverted sequences. No simple repeats were reported in the randomized
queries.

RepeatMasker returned only a single probably false match (71 bp) when
analyzing a batch of 4440 coding regions in human mRNAs (7,200,000 bp)
at sensitive settings. The coding regions were collected from GenBank,
based on annotations, filtered for the presence of complete ORFs and
initiator methionines, and made more or less non-redundant. When each
coding region was analyzed individually using the -gccalc option, 5
matches (414 bp, 0.006%) were falsely masked (156 bp at default speed,
76 bp at quick settings). In this analysis each sequence was analyzed
with matrices chosen based on the actual GC level, even for very short
sequences, while in the batch analysis of the coding regions the
'average' 43% GC matrices were used.

The 1998 and later versions of RepeatMasker show somewhat more false
masking when a pre-1998 version of cross_match is used. These are
primarily the result of improper assumptions of the background
nucleotide frequency used in the scoring matrix calculation when
adjusting for the complexity of a match. Specifically, a very GC rich
region in an AT-rich isochore (like an exon) may improperly match a GC
rich repeat, since the scores for C/G matches are higher in the used
scoring matrix than for AT matches (calculated for this AT rich
background) whereas the old cross_match assumed that a 50% GC
background in these calculations and equal scores for A/T and G/C
matches have been given. The new version of cross_match reads the
correct nucleotide background level from the matrix used.


Use in database searches

RepeatMasker is most commonly used to avoid spurious matches in
database searches. Generally this step is strongly recommended before
doing BLASTN or BLASTX equivalent searches with mammalian DNA
sequence.

The most common concern is of course if RepeatMasker ever masks coding
regions.
We found that false matches in coding regions are extremely rare, but
did identify 38 genuine fragments of interspersed repeats (4214 bp) in
the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb)
analyzed (excluding annotated coding sequences of LINE1 elements and
endogenous retroviruses). We verified matches with lower scores by
comparing the translation products to close homologous or redundant
entries in the database (the repeat matching regions always were
exactly missing). In the majority of these cases, the sequences appear
to be improperly annotated or to represent either artificially or
naturally defective mRNAs (e.g. alternatively spliced exons comprised
of a small fragment of a repeat). Genuine overlaps of interspersed
repeats with coding sequences usually involve terminal regions of the
ORFs. Since the transposable element derived region is unique to the
protein in that (group of) species, the masking does not interfere
with database searches.

However, some cautionary comments are necessary. First, a few active
cellular genes are derived from transposable elements (see my 1999
review for a list of 19 in our genome). Some of these genes will be
partially masked by a (related) transposon in the repeat database. EST
and cDNA matches beyond the masked region should alert you.

Also be aware that RepeatMasker screens for small RNA pseudogenes and
will therefore mask the active small RNA genes as well (I think the
tRNA list is complete, I stopped adding snRNAs unless I found an
indication that they have created many pseudogenes). The number of
matches to small RNAs are listed in the overview table; (close to)
exact matches are possibly active genes, although related active genes
not in the database may show diverged matches.

A final caution relates to the fact that 3' UTRs of transcripts are
about as dense in interspersed repeats as intergenic regions
are. Thus, many ESTs are completely masked as repetitive DNA. I
recommend that, when you compare a genomic sequence against the EST
database or use ESTs as a query in nucleotide searches, you search
with the unmasked sequence as well; use a long minimum match (word
length/ word size) like 40 bp to identify exact matches and avoid most
background. Unfortunately the maximum word length that can be used in
the NCBI BLASTN program is 18 (due to memory limitations).


Use in association with gene prediction programs

Predicting genes from a masked sequence has several problems. First,
one should use the option -low to avoid masking low complexity regions
and trinucleotide repeats in coding regions. But even with only
interspersed repeats masked, gene prediction programs may fail to
identify exons correctly. As pointed out above, sometimes tail ends of
coding regions may have originated from transposable elements. Even if
no coding regions have been masked, splice sites may be compromised;
e.g. the polypyrimidine region that contributes to an acceptor splice
site may be contained within a repeat.

Thus, I generally recommend to run a gene prediction program on
unmasked DNA (as well) and compare the predicted genes and exons with
the RepeatMasker output. Some gene prediction program allow you to
force certain exons out of the predictions (e.g. often the old ORFs of
LINE1 elements and endogenous retroviruses are included in
genes). Work is also in progress at several sites to incorporate
RepeatMasker into gene prediction programs, in which cases matches to
repeats are weighted in along with the other parameters used.


Other uses

Many people mask repeats before designing primers or oligo probes from
sequence data. I've been told often that primers/probes designed from
regions unmasked by RepeatMasker have a much better success rate. A
cautionary note here is that unmasked regions not necessarily are
unique in the genome (e.g. many lower copy repeats are not in the
database yet) and experiments should be performed as if no filtering
against repeats has been done. The alignments can help in designing
primers from sequences that are completely masked. Regions that
diverge much from the consensus are less likely to misbehave than
others.

RepeatMasker is sometimes used during assembly of large genomic
sequences. This procedure probably is most useful in very Alu rich
regions; in that situation I recommend to only mask the Alus, and
maybe limit the masking to those Alus less than 15% diverged (-div
15).

There are plenty of other uses, e.g. analysis of repeats can reveal a
lot about the evolution of a locus (deletions vs insertions,
inversions, approximate time of these events). When you're doing that
you're a specialist and don't need any help from this help file (maybe
from some of the literature sited below though).


Low-complexity DNA

By default, along with the interspersed repeats, RepeatMasker masks
low-complexity DNA. Simple repeats (micro-satellites) can originate at
any site in the genome, and therefore have an interspersed
character. Other low-complexity DNA, primarily poly-purine/
poly-pyrimidine stretches, or regions of extremely high AT or GC
content will result in spurious matches in some database searches as
well (especially in the ungapped BLASTN searches). For example,
extremely AT-rich regions consistently will give very low probability
matches to mitochondrial DNA in BLASTN searches. The settings are very
stringent, and we think that few if any sequences informative in
database searches are masked as low-complexity DNA. However, you can
skip the low-complexity DNA masking using the option -nolow or -l(ow).

Under the current settings a 100 bp stretch of DNA is masked when it
is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides. The settings are slightly more stringent than the
original settings, partly because the gapped BLAST programs are less
sensitive to short regions of low complexity then the old gapless
BLAST. In coding regions I have not yet found extensive regions (>10
bp) masked as low complexity DNA that would not be masked by the
combined XNU and SEG filters routinely used in BLASTX.


Annotation of simple repeats

Although RepeatMasker does a good job in masking simple repeats to
avoid spurious matches in database searches, it is not written to find
and indicate all possibly polymorphic simple repeat sequences. Only
di- to pentameric and some hexameric repeats are scanned for and
simple repeats shorter than 20 bp are ignored. The -poly option prints
out a separate list of simple repeats of < 10% divergence from a
perfect repeat. However, even long perfect repeats may not be
presented in this list; e.g. two perfect 40 bp long (CA)n repeats
interrupted by 10 Ts are aligned in one piece and may be reported as
having > 10% divergence from the consensus. Perfect hexameric
etc. repeats are sometimes listed as quite diverged smaller unit
repeats and won't appear in the .polyout file.

Also note that, in the default output, simple repeats expanded from
the poly A tails of ALUs and LINE1 are now included in the Alu or
LINE1 annotation. This cleans up the annotation a bit and lets the
stan-alone poly A regions stand out (they may indicate the presence of
a processed pseudogene). However, even perfect simple repeats in such
tails will be hidden in the .out file.

A program optimized to quickly find all dimeric to pentameric repeats
is sputnik, available at ftp://ftp.nhgri.nih.gov/pub/software/sputnik/
or http://www.abajian.com/sputnik/. Some web sites dedicated to
identifying polymorphic tandem repeats are at http://pompous.swmed.edu
and http://c3.biomath.mssm.edu/trf.html


HOW TO READ THE RESULTS

The annotation (.out) file

The annotation file contains the cross_match summary lines. It lists
all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA. The term "best matches" reflects that a match is not
shown if its domain is over 80% contained within the domain of a
higher scoring match, where the "domain" of a match is the region in
the query sequence that is defined by the alignment start and
stop. These domains have been masked in the returned masked sequence
file. In the output, matches are ordered by query name, and for each
query by position of the start of the alignment.

Example:

1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 336 103
12204 10.0 2.4 1.8 HSU08988 6782 7714 (21529) C TIGGER1 DNA/MER2_type (0) 2418 1493
279 3.0 0.0 0.0 HSU08988 7719 7751 (21492) + (TTTTA)n Simple_repeat 1 33 (0)
1765 13.4 6.5 1.8 HSU08988 7752 8022 (21221) C AluSx SINE/Alu (23) 289 1
12204 10.0 2.4 1.8 HSU08988 8023 8694 (20549) C TIGGER1 DNA/MER2_type (925) 1493 827
1984 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1
12204 10.0 2.4 1.8 HSU08988 9001 9695 (19548) C TIGGER1 DNA/MER2_type (1591) 827 2
711 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2

This is a sequence in which a Tigger1 DNA transposon has integrated
into a MER7 DNA transposon copy. Subsequently two Alus integrated in
the Tigger1 sequence. The simple repeat is derived from the poly A of
the AluSx element. The first line is interpreted as such:

1306 = Smith-Waterman score of the match, usually complexity adjusted
The SW scores are not always directly comparable. Sometimes
the complexity adjustment has been turned off, and a variety of
scoring-matrices are used.

15.6 = % substitutions in matching region compared to the consensus
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)
HSU08988 = name of query sequence
6563 = starting position of match in query sequence
7714 = ending position of match in query sequence
(22462) = no. of bases in query sequence past the ending position of match
C = match is with the Complement of the consensus sequence in the database
MER7A = name of the matching interspersed repeat
DNA/MER2_type = the class of the repeat, in this case a DNA transposon
fossil of the MER2 group (see below for list and references)
(0) = no. of bases in (complement of) the repeat consensus sequence
prior to beginning of the match (so 0 means that the match extended
all the way to the end of the repeat consensus sequence)
2418 = starting position of match in database sequence (using top-strand numbering)
1465 = ending position of match in database sequence

An asterisk (*) in the final column (no example shown) indicates
that there is a higher-scoring match whose domain partly
(<80%) includes the domain of this match.

Note that the SW score and divergence numbers for the three Tigger1
lines are identical. This is because the information is derived from a
single alignment (the Alus and simple repeat were deleted from the
query before the alignment with the Tigger element was performed). The
ProcessRepeats script makes educated guesses about many fragments if
they are derived from the same element (e.g. it knows that the MER7A
fragments represent one insert).

When you use the -id option, an extra column appears that gives each
element a uniques ID. Here is another example that shows how much
trouble processrepeats does to defragment elements and how the ID can
be useful in interpreting the results:

7120 19.9 0.6 0.3 NT_001227 85631 87837 (19816) + L1PA16 LINE/L1 -707 1176 (4970) 123
2503 14.9 6.5 0.7 NT_001227 87839 88241 (19412) + MSTA LTR/MaLR 1 426 (0) 100
867 12.9 2.7 0.0 NT_001227 88242 88388 (19265) + MSTA-int LTR/MaLR 1 151 (1500) 100 *
5219 19.5 2.9 0.6 NT_001227 88386 89342 (18311) + MSTA-int LTR/MaLR 629 1607 (44) 100
8003 3.5 0.8 0.0 NT_001227 89362 90773 (16880) C L1PA3 LINE/L1 (0) 6155 4745 103
7677 3.5 0.0 0.0 NT_001227 90795 94059 (13594) C L1PA3 LINE/L1 (0) 6155 2872 104
9050 6.5 0.4 0.1 NT_001227 94060 95127 (12526) C MER11C LTR/ERVK (0) 1071 1 106
7677 3.5 0.0 0.0 NT_001227 95128 97101 (10552) C L1PA3 LINE/L1 (3282) 2873 900 104
5619 7.8 0.3 0.9 NT_001227 97097 97865 (9788) C L1PA3 LINE/L1 (5370) 776 13 104 *
320 16.9 0.0 1.7 NT_001227 97876 97934 (9719) + MSTA-int LTR/MaLR 1594 1651 (0) 100
1475 19.0 4.8 5.6 NT_001227 97935 98255 (9398) + MSTA LTR/MaLR 1 323 (48) 100
2322 14.4 0.8 1.6 NT_001227 98256 98629 (9024) + THE1C LTR/MaLR 1 371 (0) 112
10051 12.9 3.5 4.3 NT_001227 98630 100221 (7432) + THE1C-int LTR/MaLR 1 1580 (0) 112
2359 15.7 0.3 1.9 NT_001227 100224 100598 (7055) + THE1C LTR/MaLR 3 371 (0) 112
1475 19.0 4.8 5.6 NT_001227 100599 100646 (7007) + MSTA LTR/MaLR 323 371 (0) 100
1360 19.4 8.2 1.7 NT_001227 100662 100955 (6698) + MSTA LTR/MaLR 114 426 (0) 113
11892 24.7 1.9 2.0 NT_001227 100968 101243 (6410) + L1PA16 LINE/L1 1185 1460 (4686) 123
2062 11.9 8.4 0.0 NT_001227 101244 101563 (6090) C L1PA12 LINE/L1 (10) 6164 5818 116
11892 24.7 1.9 2.0 NT_001227 101564 105425 (2228) + L1PA16 LINE/L1 1460 5286 (860) 123
257 0.0 0.0 2.9 NT_001227 105436 105469 (2184) + (TAA)n Simple 2 34 (0) 118
2189 18.2 0.2 0.7 NT_001227 105470 105893 (1760) + L1PA16 LINE/L1 5359 5780 (394) 123
255 6.1 0.0 0.0 NT_001227 105896 105928 (1725) + (TA)n Simple 1 33 (0) 120 *
369 0.0 0.0 0.0 NT_001227 105928 105968 (1685) + (GA)n Simple 2 42 (0) 121
305 18.8 0.0 1.0 NT_001227 105971 106066 (1587) + (TA)n Simple 2 96 (0) 122
1589 21.2 1.6 1.1 NT_001227 106068 106449 (1204) + L1PA16 LINE/L1 5782 6165 (1) 123

This entire 20,819 bp block of sequence is comprised by an L1PA16
(#123), in which 7 or 8 elements have integrated (it is unclear to me
if the MSTA #113 is a separate integration or local tandem
duplication). There are at least four layers with MER11 (#106)
inserted in L1PA3 (#104) inserted in MSTA (#100, maybe in #113)
inserted in L1PA16. L1PA16 is already primate specific, so that all
these insertions took place in primate evolution.

I think the last column helps much in deciphering the events. It also
should be a basis for the graphic display of repeatmasker output.


Alignments

When using the -a option, a .align file is created that contains the
alignments of your query sequence to the matching repeat consensus
sequences. The alignments are given in the same order as listed in the
.out file.

These alignments may be most generally useful for people designing PCR
primers in a region full of repeats. It is possible to get primers
that work in a whole genome, when the 3' end of it lies in a region of
(even a common) repeat that is very different from the consensus.

Here is an example of an alignment of a MIR spanning an Alu element
deleted in an earlier step:

665 28.45 2.93 5.02 g5129s420 7350 7882 (1924) C MIR#SINE/MIR (1) 261 28 3

g5129s420 7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397
v v i i i v viv v i v v v
C MIR#SINE/MIR 261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212

g5129s420 7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443
vi i iv i i i i i i v i
C MIR#SINE/MIR 211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164

g5129s420 7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787
iiv v i iii v i i i
C MIR#SINE/MIR 163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123

g5129s420 7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837
v i v i i v v v ii v i ii
C MIR#SINE/MIR 122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74

g5129s420 7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882
iivi i iiii i i i i v i
C MIR#SINE/MIR 73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28

Transitions / transversions = 1.96 (45 / 23)
Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8)


In cross_match alignments mismatches caused by transitions are
indicated with an i and those by transversions with a v. The position
of the deleted Alu in the query is indicated with an X in the
g5129s420 sequence. You can use the -inv option to produce alignments
in the orientation of the consensus sequence.
The lines in the .out file describing this match appear as:

578 28.4 2.9 5.0 g5129s420 7350 7467 (533) C MIR SINE/MIR (1) 261 149
2222 10.2 2.7 0.0 g5129s420 7468 7762 (238) C AluSg SINE/Alu (7) 303 1
578 28.4 2.9 5.0 g5129s420 7763 7882 (118) C MIR SINE/MIR (113) 149 28



Discrepancies between alignments and the .out file

Discrepancies between alignments and annotation result from the
adjustments made by the ProcessRepeats script to produce more legible
annotation. This annotation also tends to be closer to the biological
reality than the raw cross_match output.

For example, adjustments often are necessary when a repeat is
fragmented through deletions, insertions, or an inversion. Many
subfamilies of repeats closely resemble each other, and when a repeat
is fragmented these fragments can be assigned different subfamily
names in the raw output. ProcessRepeats often can decide if fragments
are derived from the same integrated transposable element and which
subfamily name is appropriate (subsequently given to all fragments).
This can result in discrepancies in the repeat name and matching
positions in the consensus sequence (subfamily consensus sequences
differ in length).

In many cases matches are fused into one annotation. To give just four
common examples: (1) A-rich simple repeats originated from the poly A
tail of ALUs and LINEs are incorporated in the annotation of the Alu
or LINE1. (2) In large sequences that RepeatMasker analyses in
fragments, consecutive fragments overlap, and repeats in these
overlaps will appear twice (partially or whole) in the alignment file.
(3) There is an 'endless' number of subfamilies for retroposons which
can not all be represented in the databases and sometimes an element
is matched by overlapping pieces of two related subfamilies (which
will be merged). (4) You may find large discrepancies in position
numbering if an element includes tandem repeat units. For example,
MER109 contains multiple ~300 bp repeat units which can lead to
overlapping matches. In the annotation such matches are fused.

Specific LINE1 problems:

Some other discrepancies are specific to LINE elements. These repeats
do not appear as complete elements in the consensus database. This is
mostly due to the contrast in conservation over the length of its
sequence during its evolution in the mammalian genome; the ~3 kb ORF2
region of LINE1 has been very conserved, whereas the untranslated
regions and ORF1 to a lesser degree have evolved very fast. Thus the
3' end or 5' end of an ancient LINE1 does not even remotely resemble
that of the currently active LINE1, whereas the coding region for
reverse transcriptase is closely related. Thus, many subfamilies have
been defined for both the 5' and 3' UTRs (30 and 52, resp.) of LINE1
elements in human DNA, whereas only four ORF2 entries are present in
the database. Besides some remaining uncertainties about which 5' ends
go with which 3' ends, including 50 full length (6.2-8 kb) LINE1
elements in the database would make the program very slow. LINE1
elements therefore are presented in the database in 3 pieces, and the
ProcessRepeats script tries to put these pieces together as well as
possible. As a result both the names of the repeats and position
numbering in the consensus sequence are generally different in the
alignments than in the output file. The currently 3.3 kb LINE2
elements are likewise broken up in the databases, in 3' UTRs for
different subfamilies and one (complete!) ORF2 region.

Between LINE1 subfamilies, the 3' UTR ranges from 500 bp to over 2000
bp (in L1MC/D3), and the length of the 5' UTR is even more variable,
even between subfamilies that show strong similarity in the 3' UTR.
To allow the LINE1 fragments to be put together, all position numbers
in older LINE1 subfamilies are normalized relative to the position of
ORF2 (the conserved part of LINE1) in a complete L1PA2 element. Since
some older elements have much longer 5' UTRs or ORF1-ORF2 linker
regions than L1PA2, this often results in the assignment of negative
position numbers for the 5' end of LINEs. Since the March2000 release,
such positions and all positions in fragments thought to be part of
the same LINE1 insert are readjusted to count from the 5' end (which
is not necessarily the very 5' end of the LINE1 source gene, as these
are hard to derive for old elements). One problem with this approach
is that positions are not adjusted in detached 3' fragments that are
accidentally not recognized by the program as originating from the
same insertion, so that the connection between the 5' fragments and 3'
fragments may become obscured. Use the option '-neg' of ProcessRepeats
to retrieve an output in which all LINE1s are numbered with respect to
the ORF2 position.




The summary (.tbl) file

The summary file is pretty much self explanatory. Below is an example.

==================================================
file name: NT_001227.seq
sequences: 1
total length: 407653 bp
GC level: 38.40 %
bases masked: 356530 bp ( 87.46 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 51 13124 bp 3.22 %
ALUs 40 11536 bp 2.83 %
MIRs 11 1588 bp 0.39 %

LINEs: 135 288734 bp 70.83 %
LINE1 122 283327 bp 69.50 %
LINE2 11 3981 bp 0.98 %

LTR elements: 40 46999 bp 11.53 %
MaLRs 22 22834 bp 5.60 %
ERVL 9 9479 bp 2.33 %
ERV_classI 6 12148 bp 2.98 %
ERV_classII 2 1966 bp 0.48 %

DNA elements: 9 5226 bp 1.28 %
MER1_type 4 679 bp 0.17 %
MER2_type 4 3270 bp 0.80 %
Mariners 1 1277 bp 0.31 %

Unclassified: 0 0 bp 0.00 %

Total interspersed repeats: 354083 bp 86.86 %


Small RNA: 1 90 bp 0.02 %

Satellites: 0 0 bp 0.00 %
Simple repeats: 34 1587 bp 0.39 %
Low complexity: 19 769 bp 0.19 %
==================================================

* most repeats fragmented by insertions or deletions
have been counted as one element

The sequence(s) were assumed to be of primate origin.
RepeatMasker version 03/03/2000 default
ProcessRepeats version 03/03/2000
Repbase version 02/29/2000


The classification in this table is well defined (see my reviews in
COGD) and forms a good basis for visual presentation and tabulation of the
repeats in your study.

We've been able to classify almost all human repeats, most of them
even in subclasses. Because not all elements fit in a subclass and a
few minor subclasses are not listed separately in this table
(e.g. LINE3, the T2/TTAA and several hAT-like families of DNA
transposons), the totals for the classes often are higher than the sum
of the sub classes. The HAL1 element, potentially ancestral to LINE1,
is added to the LINE1 total in this table.

Note that the "MER" subclasses have no relationship to each other. The
term MER (MEdium Reiterated repeats) was introduced for purely
administrative purposes to give the beast a name. The MER1, MER2, and
MER4 groups were named after the first member of these groups
identified as an interspersed repat in our genome.

The nomenclature of mammalian repeats derived from retrovirus-like
elements is different from older versions. I've now divided this class
up in the traditional class I, class II (ERVK), class III (ERVL)
retroviruses and the ERVL-derived but very distinct non-autonomous
MaLR elements. Since 'class III' is not an accepted classification
yet, for now this class is called ERVK. The large MER4-group of
non-autonomous LTR elements merges seemlesly with class I endogenous
retroviruses, making it hard to define, and is now incorporated in the
latter group. The ERV classes are most readily distinguished by the
size of the insertion site duplication: 4 in class I, 6 in class II, 5
in class III. However, my LTR classification is based on internal
sequences and matches to LTRs with internal sequences, not on target
size duplication.


The ProcessRepeats script tries very hard to find out which repeat
fragments were derived from the same insertion event of a transposable
element, but there still maybe a slight overestimate. For example,
tandem duplications of (a fragment of) a repeat are counted as
separate insertions. Although this (relatively rare) phenomenon is
easily observed by eye, I haven't thought of a reliable method to
detect these in the script. However, a high number of tandem
repetitions is high can distort the estimated number of insertions
considerably.


The 'bases masked' number is calculated from the total number of Xs in
the masked sequences (before these are changed to Ns or lower case
letters). The other numbers are derived from the annotation (.out)
file. Discrepancies between the 'bases masked' number and the sum of
'total interspersed repeats', small RNA, satellites and low complexity
are generally very small. Most of these are accounted for by unmasked
regions between flanking identical simple repeats, annotated as one
stretch if fewer than 10 bases separate them, and fragments of repeats
shorter than 10 bp which are not annotated but are masked. The numbers
may be quite different if you started out with a query sequence
containing Xs.



Repeat databases

The interspersed repeat databases screened by RepeatMasker are
maintained in parallel with the repeat databases (Repbase Update)
copyrighted by the Genetic Information Research Institute (G.I.R.I.).
The Repbase Update database contains annotation of many repeats with
respect to divergence level, affiliation, copy number, etc. Much if
not most of the information in this database is not published
elsewhere. It can be accessed at
http://www.girinst.org/~server/repbase.html. We are trying to keep the
nomenclature of the interspersed repeats in the output of RepeatMasker
identical to that of the reference database. In most cases the names
correspond to those most commonly used in the literature.



Reference

We still haven't published a paper on RepeatMasker yet, but appreciate
it if you could refer to the web page (Smit,AFA & Green,P RepeatMasker
at http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) or
otherwise to Smit, AFA & Green, P., unpublished results.



Literature
This list is minimal and restricted to interspersed repeat research in
human DNA.


Overviews

Smit, A.F.A. (1999) Interspersed repeats and other mementos of
transposable elements in mammalian genomes. Curr Opin Genet Devel 9
(6), 657-663.

Jurka, J. (1998) Repeats in genomic DNA: mining and meaning. Curr Opin
Struct Biol 8 (3), 333-337

Smit, A.F.A. (1996) Origin of interspersed repeats in the human
genome. Curr Opin Genet Devel 6 (6), 743-749.

Smit, A.F.A. (1995) Origin and evolution of mammalian interspersed
repeats. PhD dissertation, USC.


SINE/Alu

Schmid, C.W. (1998) Does SINE evolution preclude Alu function? Nucleic
Acids Res 26, 4541-4550.

Schmid, C.W. (1996). Alu: structure, origin, evolution, significance,
and function of one-tenth of human DNA. Prog Nucleic Acids Res Mol
Biol 53, 283-319.

Jurka, J. (1996) Origin and evolution of Alu repetitive elements. In "
The impact of short interspersed elements (SINEs) on the host
genome. Maraia, R.J., editor. Springer Verlag


SINE/MIR & LINE/L2

Smit, AFA, and Riggs, AD. (1995). MIRs are classic, tRNA-derived SINEs
that amplified before the mammalian radiation. Nucleic Acids Res 23,
98-102.


LINE/L1

Smit, AFA, Toth, G, Riggs, AD, Jurka, J., Ancestral mammalian-wide
subfamilies of LINE-1 repetitive sequences. J Mol Biol 246, 401-417.


LTR/MaLR

Smit, A. F. A. (1993). Identification of a new, abundant superfamily
of mammalian LTR-transposons. Nucleic Acids Res 21, 1863-72.


LTR/Retroviral

Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous
Human Retroviruses. In The Retroviridae, J. A. Levy, ed. (New York:
Plenum Press), pp. 465-535.


DNA/all types

Smit, A.F.A. and Riggs, A. D. (1996). Tiggers and other DNA
transposon fossils in the human genome. Proc Natl Acad Sci USA 93,
1443-8.





Update history:

Improvements and new features in the April 1997 version compared to
the June 1996 version:

Besides a massive (2.5 fold) expansion of the databases, the program
itself is more sensitive and selective, has several new features and
an improved output. The script is now divided in two; one
(RepeatMasker) performs the cross_match searches, the other
(ProcessRepeats) takes the RepeatMasker output to create the overview
table and to improve the output in the .out file. The cross_match
searches have been optimized, especially with regard to detection of
low complexity sequences and old LINE1 elements. The most obvious
changes in the processed output file compared to the unprocessed file
are (i) overlapping matches are usually resolved, (ii) LINE1 fragments
are annotated with position numbers as in a full L1 element, and (iii)
when an Alu or LINE1 is fragmented information from both or all
fragments is used to assign a subfamily name. New features in the
program include the ability to screen a custom library and to create
an output file with alignments in positional order.

Improvements May 97: (minor update)
- added option to only mask low complexity DNA
- added version information to .tbl output
- changed artreps.lib to othermamreps.lib, adjusted parameters to
accommodate larger size of db
- many improvements in estimating number of elements in query
- added name adjustments for MLT2
- fixed many bugs...

Improvements September 1997 (minor update)
- major expansion of the rodent libraries and significant update
of the human libraries as well, especially in LINE1 elements.
- scripts modified to accommodate new entries in databases
- simple repeats masking optimized by including pentamers and
using a more stringent matrix
- several bugs fixed (e.g. sequences without repeats are now counted)
- table now displays parameters use
- temporarily, for comparison with the human LINE library the same
minimum match is used in the selective settings as in the default
settings to avoid masking small inserts in the LINE elements
- forthcoming release of cross_match has improved performance on a
tandemly repeated element (currently sometimes the lower scoring
unit may go unmasked, even when it is a common repeat)

Improvements and new features in the May 1998 version compared to
the September 1997 version:

- the program now accepts most 'not quite fasta' format files
- large sequences are analyzed in fragments of 100 kb to reduce the
memory requirements of the program. Similarly files with very
many sequence entries are divided up. You shouldn't notice any
of this in the output files.
- matrices are used that are optimal for the divergence level of the
repeats to which the query is compared and the background
nucleotide composition.
- another big update of the human repeat databases.
- the small RNA sequences have been corrected and expanded (all tRNAs
should be there now)
- close to perfect simple repeats, full-length shorter interspersed
repeats and young LINE1 3' ends are excised from the sequence
(in both human and rodent analysis) to allow better detection
of any underlying repeats. A sequence file with these repeats
deleted can be saved.
- the -low option doesn't mask out any type of simple repeats anymore
- alignments are shown in the orientation of the query sequence
- new options include
masking Alus only
obtaining a sequence with full lengths repeats deleted
obtaining a(n incomplete) list of possibly polymorphic microsatellites
setting a cutoff score when using the -lib option.
minor fixes
- the .out.xm and .ace files now also contain the simple repeats and
low complexity DNA (can still be omitted by running
ProcessRepeats with the -low option on the .cat file)
- sequence names including a number between parentheses used to
confuse the program thoroughly; now fixed
- many that you wouldn't find interesting



Improvements and new features December 1998

- This version is optimized for use with the 1998 cross_match release
The difference for RepeatMasker is mainly in the complexity adjusted
length of the matches that function as kernels for Smith Waterman
alignments and the matrix dependent adjustment of the score for
complexity of the alignment.
- Among bugs in the May 1998 version fixed are those resulting in
bogus output when the sequence name ends with .seq and when a raw
sequence is submitted. Also, sequence files that contain carriage
returns from PCs and Mac are handled better now.
- You can now limit the masking to younger repeats by setting a
maximum allowed divergence of repeats from their consensus sequence
- A mRNA/EST option is available that prevents false masking due to
inappropriate matrix choice and low complexity matches to LINE1 elements.
- You can set the background GC level (determining which matrices are
used) overriding the programs' calculations.
- The full description ('>') lines are retained in the masked file.
- The .out file table can be returned with flexible length columns
allowing the full length of long query sequence names to be displayed
- The sequences identified as repeats can be returned in lower case
(rest in capitals) rather than masked out by Ns or Xs.
- Output to the screen is more informative and less panicky
- Simple repeat and satellite masking has been improved again; their
annotation has changed a bit, most notably they are now all listed in
the orientation of the query sequence


April 1999

The default return format of the annotation file is changed, hopefully
in a way that does not interfere with any type of parsing; the width
of the columns is now adjusted to the longest entry in that column,
allowing query names to be spelled out in full, and usually leading to
narrower tables.

Arabidopsis, Drosophila, and grass repeat libraries were added; other
repeat libraries were updated.

Three measures were taken to eliminate the (few) false positives:
- Use of the actual average GC level of sequences in a batch file may
sometimes lead to false masking (or failure to mask) in sequences that
diverge largely from the average. Thus, by default, all batch files
are now analyzed with the innocuous 43% matrices.
- one entry, responsible for 90% of false masking in GC rich regions,
is deleted from the 'tough L1' library.
- the matrix used for identification of the most diverged sequences in
very GC rich regions, based on too little data and too much
extrapolation, was 'too easy' on the mismatches and has been
adjusted.
Thanks to these measures the 'mrna' option is not necessary and has
been removed.

A bug is fixed that led to (wildly) improper annotation for some
sequences fully consisting of repeats (all bases masked). A series of
lesser bugs were taken care of. New bugs were skillfully introduced,
probably.

May 5 1999
- Eliminated a really dumb bug that resulted in having the percent
deletions replaced by the percent insertions.
- Made it easier to use your own database with repeatmasker. The
database does not have to be in the repeatmasker directory.


March 2000

Besides a long overdue update of the databases the following
improvements have been made:

speed, sensitivity, user-friendliness
- It is now possible to run large sequences and batch files on
multiple processors.
- An even faster option (-qq) is available for people in a serious hurry
- More repeats are cut out, in particular LINE1 3' fragments, to better
uncover underlying repeats
- I've reduced the default fragment length to 51000 bp (incl 1000 bp
overlaps); this gives a slightly lower chance of running out of memory
(followed by resorting to a larger wordlength) and sometimes better
choice of substitution matrices.
- The -cut option does not overrule fragmentation anymore
- RepeatMasker now handles zipped (.gz) and compressed (.Z) sequence files
- You can now quit the program at any point with 'control-c'.


annotation, display, summary
- An option is added providing unique IDs for individualy integrated
elements, labeling fragments of the same element with the same number
- Classification of mammalian LTR elements has changed (now includes
the conventional three ERV classes)
- Some repeat names have been adjusted (notably the MLT2 subfamilies) to
be consistent with the RepBase nomenclature
- Improved interpretation of fragmented sequences resulting in more
accurate counts (for the .tbl file) of total insertions in the query
sequence
- Negative coordinates in LINE1 elements are now avoided (but see
'Specific LINE1 problems' in helpfile above)
- Improved accounting of LTR elements; now most LTR elements receive
the same name for the LTRs and internal sequence and are counted as
one insertion.
- Divergence and insertion/deletion levels are calculated for
annotations that are derived from two or more fused fragments
- Fixed the .ace output so that the orientation of the match is displayed.
- Output can be retrieved in the GFF (General Feature Finding)
format. The current output is following a Sanger convention.

bugs
- The .tbl file format was not prepared for sequences over 10 million bp.
It's now ready for sequences up to 1 billion bp. For larger sequences,
I'd recommend doing the analysis in two or more steps...
- A bug has been fixed that crashed scripts trying to start several
RepeatMasker jobs simultaneously
- A bug is fixed that resulted in sometimes incorrect output, when
multiple files were fed to repeatmasker and one was masked in full
- Sequences and fragments >> 100 kb completely existing of Ns (no ACGT)
used to crash the program
- Drosophila and Arabidopsis masking allowed no overlaps in matches..
- Several other bugs were fixed that gave slightly incorrect output
under cruel and unusual circumstances






If you have ideas for improvements or found a bug, drop a note
at asmit@nootka.mbt.washington.edu or asmit@systemsbiology.org

/*****************************************************************************
# Copyright (C) 1996-2000 by Arian Smit
# All rights reserved.
#
# The software and databases should not be redistributed or used for
# any commercial purpose, including commercially funded sequencing,
# without written permission from the author and the University of
# Washington.
/*****************************************************************************