Help summary
Detailed descriptions
All SPatt programs have a similar syntax:
Statistics for Patterns Usage: [s|g|cp|ld|x]spatt [-hVqvdb] [--debug-level=] [-a descriptor] [FASTA sequence] [-M ] [-S ] [-U ] [-C ] [-P ] [-l ] [-p descriptor]... [-m ] [--all-words] [--normalize] [--max-pvalue] [--nobs ] -h, --help displays this message -v, --version displays version number -q, --quiet quiet output (same effect as --debug-level -1) -V, --Verbose verbose output (same effect as --debug-level 1) -d, --debug debug output (same effect as --debug-level 2) --debug-level= set the verbose level: -1 (quiet) 0 (normal) 1 (verbose) 2 and more (debug) -a, --alphabet=descriptor define the alphabet to use ex: -a "acgt:tgca" for standard DNA alphabet -M, --Markov-file= destination file for the Markov model parameters -S, --Stationary-file= destination file for the stationary distribution -U, --Use-markov-file= file containing the Markov model parameters -C, --Count-file= file containing word counts -P, --Pattern-file= file containing patterns -l, --length= length of the longest word counted -p, --pattern=descriptor define a pattern (could be used several times) -b, --both-strands count patterns on both strands (useless if no complementary alphabet defined) -m, --markov= order of the Markov model --all-words process all words of the given length -n, --normalize statistics are normalized by the length of the sequence --nobs set number of observation from this option rather than using the sequence
Alphabet descriptor
Pattern descriptor
FASTA Sequence
Both strands
Markov model order
Markov file
Stationary file
Pattern file
Count file
Normalize
pvalue filtering
custom number of observations
this allows to specify the alphabet you want to use reading sequence and patterns. First, each letter of the alphabet must be given as an ordered sequence, each letter must be either a simple character either a list of character between two brackets. Then, optionally, a complementary alphabet could be specified after a ":". Only one character per letter should be used in this way.
By default, white space (blanks, tabulations, carriage returns, ...) are ignored and alphabets are considered not to be case sensitive unless two different cases have been used in alphabet descriptor.
Each time a invalid character is found in a sequence, it is considered as an interruption and has therefore the same result than the separation of the sequence in two different pieces.
Here follows some examples of valid descriptors:
"acgt:tgca"for the standard DNA alphabet (this is also the default alphabet)
"[ag][ct]"for the purine-pyrimidine alphabet
"abcdefghijklmnopkrstuv"for the standard Latin alphabet
"acgtN"for a custom case sensitive DNA alphabet including the letter N
"[ARNDCE][QGHI][LKM][FPST][WYV]"amino-acid alphabet in five groups
A pattern descriptor should use only valid characters from the specified alphabet as well as "_" and "[" or "]". These extra characters allow to consider positions where the pattern could be degenerate: "." means any character and the brackets allow to specify a list of authorized characters. The character "|" could also be used as a separator between several patterns in the same descriptor
Here come few examples:
"gctgg[tc]gg"expands to
{gctggtgg,gctggcgg}"[at]atg.a"expands to
{aatgaa,aatgca,aatgga,aatgta,tatgaa,tatgca,tatgga,tatgta}"agct|tgcat"expands to
{agct,tgcat}All sequence must be in FASTA format which is very simple. A text file containing one ore more sequences all starting with a title line which first character should be a ">". No more than 100 columns should be used in the file (other characters will simply be ignored).
Here is a simple example:
> sequence 1 atcgtagctagc atcgatcggtag aa > sequence 2 atatattagcta atagatcgatcg aatatatag
This option allows to complete the patterns with their inverse complementary. This way, occurrences of the pattern on the two strands are taken into account
Some examples:
-p gctggtgg -bgive the same result as
-p gctggtgg|ccaccagc
-p tc.a -bgive the same result as
-p tc.a|t.ga
This determine the order of the Markov model used to modelize the random sequence. By default, its parameters are estimated on the observed sequence using maximum of likelihood. Alternatively the parameters could be provided by a Markov file through the --Use-markov-file option.
order 0 correspond to the independent model and order -1 to the independent and uniform model (no parameter).
A markov file contain the parameters of a markov model. It is a simple text file where line starting with a "#" are considered as commentary. For an order m model on a size k alphabet, the file must contain k^m non commentary lines each of it containing k real number. Each line correspond to the value of the last m letters (in alphabet order) and the column to the following letter.
Such file is either produced as an output of the program (--Markov-file option) or either used as input parameters (--Use-markov-file option)
A stationary file contain the stationary distribution of a Markov model. It is a simple text file where line starting with a "#" are considered as commentary. The file contain k^m lines of one real number for an order m model on a size k alphabet. Such a file can be produced through the (--Stationary-file option).
A pattern file contains a list of pattern descriptor. It is a simple text file where line starting with a "#" are considered as commentary. Each non commentary line is considered as a pattern descriptor.
A count file contains the number of occurrences of every words of a given length L. It is a simple text file where line starting with a "#" are considered as commentary. On each non commentary line is given a number of occurrence. All words of length L are treated this way in the alphabetic order.
Statictics usually grow in magnitude with the length of the considered
sequence, this is well known in the large deviations theory and can be
simply corrected by considering the normalization provided by the
--normalize option:
normalizedS = - 1/n log10[ proba( N(P) > Nobs(P) ) ] when P is seen more
than expected
and
normalizedS = + 1/n log10[ proba( N(P) < Nobs(P) ) ] when P is seen less
than expected
Using this option, only pattern with a pvalue lower than the limit given as an argument are produced by the program. Please note that all statistics are computed anyway, this is just an output filtering option.
Using this option, all considered pattern will have a observed number of occurrences given by this option (instead of being computed from the sequence). Useful for testing and distribution purposes.