SPatt (Statistic for Patterns) is a suite of C++ programs
designed for the computation of pattern occurrencies p-value on text.
Assuming the text is generated according to Markov model,
the p-value of a given observation is its probability to occur. The lower is
the p-value, the more unlikely is the observation. For example, this tools
can be used to find patterns with unusual behaviour in DNA sequences.
Let us note N(P) the number of occurrences of a pattern P on a
given sequence. If we consider the sequence is random (according to
a model of our choice), N(P) become a random variable and we can
associate p-values to observations using the following statistic:
S = - log10[ P( N(P) > Nobs(P) ) ] when P is seen more than expected
and
S = +log10[ P( N(P) < Nobs(P) ) ] when P is seen less than expected
For example S=+3.23 means the pattern is over-represented (seen more
than expected) with a p-value
of 10^-3.23 = 5.888e-4. S=-12.67 means the pattern is under-represented
(seen less than expected) with a p-value of 10^-12.67 = 2.138e-13.
Several tools are provided based on different statistics methods:
- S-SPatt (Simple Statistics for Patterns) compute p-value using binomial
approximation. This approximation is known to be false but is in fact a very
fast and reliable heuristic.
- G-SPatt (Gaussian Statistics for Patterns) compute expectation and
variance for pattern counts and derive from these a p-value approximation.
(not yet available !)
-
LD-SPatt (Large Deviation Statistics for Patterns) is based on the large deviations theory, the computed p-value
are especially reliable for the smallest but are asymptotic and so must be used
with care on short sequences (let say less than 10000 long).
In case of such sequences, exact approachs should be preferred.
(not yet available in the SPatt package but a separated package is available
here !)
-
X-SPatt (eXact Statistics for Patterns) propose to use exact computations
with arbitrary precision methods to give high quality p-value. Memory requirements are
growing linearly with sequence length and time complexity is proportional both to sequence
length and pattern number of occurrences. Therefore, this method should be used only on short
sequence (let say less than 10000 long).
(not yet available !)