Manpage of SPatt

Table of content:

Help summary
Detailed descriptions

help summary (top)

All SPatt programs have a similar syntax:

*-SPatt command line options

Statistics for Patterns

Usage: [s|g|cp|ld|x]spatt [-hVqvdb] [--debug-level=] [-a descriptor] [FASTA sequence]
	[-M ] [-S ] [-U ] [-C ] [-P ]
	[-l ] [-p descriptor]... [-m ] [--all-words] [--normalize] [--max-pvalue]
	[--nobs ]

  -h, --help                     displays this message
  -v, --version                  displays version number
  -q, --quiet                    quiet output (same effect as --debug-level -1)
  -V, --Verbose                  verbose output (same effect as --debug-level 1)
  -d, --debug                    debug output (same effect as --debug-level 2)
  --debug-level=          set the verbose level:
                                 -1 (quiet)
                                 0 (normal)
                                 1 (verbose)
                                 2 and more (debug)
  -a, --alphabet=descriptor      define the alphabet to use
                                 ex: -a "acgt:tgca" for standard DNA alphabet
  -M, --Markov-file=   destination file for the Markov model parameters
  -S, --Stationary-file= destination file for the stationary distribution
  -U, --Use-markov-file= file containing the Markov model parameters
  -C, --Count-file=    file containing word counts
  -P, --Pattern-file=  file containing patterns
  -l, --length=             length of the longest word counted
  -p, --pattern=descriptor       define a pattern (could be used several times)
  -b, --both-strands             count patterns on both strands
                                 (useless if no complementary alphabet defined)
  -m, --markov=           order of the Markov model
  --all-words                    process all words of the given length
  -n, --normalize						statistics are normalized by the length of the sequence
  --nobs									set number of observation from this option rather than
  											using the sequence

More details (top)

Alphabet descriptor
Pattern descriptor
FASTA Sequence
Both strands
Markov model order
Markov file
Stationary file
Pattern file
Count file
Normalize
pvalue filtering
custom number of observations

Alphabet descriptor (back)

this allows to specify the alphabet you want to use reading sequence and patterns. First, each letter of the alphabet must be given as an ordered sequence, each letter must be either a simple character either a list of character between two brackets. Then, optionally, a complementary alphabet could be specified after a ":". Only one character per letter should be used in this way.

By default, white space (blanks, tabulations, carriage returns, ...) are ignored and alphabets are considered not to be case sensitive unless two different cases have been used in alphabet descriptor.

Each time a invalid character is found in a sequence, it is considered as an interruption and has therefore the same result than the separation of the sequence in two different pieces.

Here follows some examples of valid descriptors:

Pattern descriptor (back)

A pattern descriptor should use only valid characters from the specified alphabet as well as "_" and "[" or "]". These extra characters allow to consider positions where the pattern could be degenerate: "." means any character and the brackets allow to specify a list of authorized characters. The character "|" could also be used as a separator between several patterns in the same descriptor

Here come few examples:

FASTA sequence (back)

All sequence must be in FASTA format which is very simple. A text file containing one ore more sequences all starting with a title line which first character should be a ">". No more than 100 columns should be used in the file (other characters will simply be ignored).

Here is a simple example:

> sequence 1
atcgtagctagc
atcgatcggtag
aa
> sequence 2
atatattagcta
atagatcgatcg
aatatatag

Both strands (back)

This option allows to complete the patterns with their inverse complementary. This way, occurrences of the pattern on the two strands are taken into account

Some examples:

Markov model order (back)

This determine the order of the Markov model used to modelize the random sequence. By default, its parameters are estimated on the observed sequence using maximum of likelihood. Alternatively the parameters could be provided by a Markov file through the --Use-markov-file option.

order 0 correspond to the independent model and order -1 to the independent and uniform model (no parameter).

Markov file (back)

A markov file contain the parameters of a markov model. It is a simple text file where line starting with a "#" are considered as commentary. For an order m model on a size k alphabet, the file must contain k^m non commentary lines each of it containing k real number. Each line correspond to the value of the last m letters (in alphabet order) and the column to the following letter.

Such file is either produced as an output of the program (--Markov-file option) or either used as input parameters (--Use-markov-file option)

Stationary file (back)

A stationary file contain the stationary distribution of a Markov model. It is a simple text file where line starting with a "#" are considered as commentary. The file contain k^m lines of one real number for an order m model on a size k alphabet. Such a file can be produced through the (--Stationary-file option).

Pattern file (back)

A pattern file contains a list of pattern descriptor. It is a simple text file where line starting with a "#" are considered as commentary. Each non commentary line is considered as a pattern descriptor.

Count file (back)

A count file contains the number of occurrences of every words of a given length L. It is a simple text file where line starting with a "#" are considered as commentary. On each non commentary line is given a number of occurrence. All words of length L are treated this way in the alphabetic order.

Normalize (back)

Statictics usually grow in magnitude with the length of the considered sequence, this is well known in the large deviations theory and can be simply corrected by considering the normalization provided by the --normalize option:

normalizedS = - 1/n log10[ proba( N(P) > Nobs(P) ) ] when P is seen more than expected
and
normalizedS = + 1/n log10[ proba( N(P) < Nobs(P) ) ] when P is seen less than expected

pvalue filtering (back)

Using this option, only pattern with a pvalue lower than the limit given as an argument are produced by the program. Please note that all statistics are computed anyway, this is just an output filtering option.

custom number of observations (back)

Using this option, all considered pattern will have a observed number of occurrences given by this option (instead of being computed from the sequence). Useful for testing and distribution purposes.