Statistics
Let us note N(P) the number of occurrences of a pattern P on a
given sequence. If we consider the sequence is random (according to
a model of our choice), N(P) become a random variable and we can
associate p-values to observations using the following statistic:
S = - log10[ proba( N(P) > Nobs(P) ) ] when P is seen more than expected
and
S = +log10[ proba( N(P) < Nobs(P) ) ] when P is seen less than expected
For example:
- S=+3.23 means the pattern is over-represented (seen more
than expected) with a p-value
of 10^-3.23 = 5.888e-4.
- S=-12.67 means the pattern is under-represented
(seen less than expected) with a p-value of 10^-12.67 = 2.138e-13.
Several tools are provided based on different statistics methods:
- S-SPatt (Simple Statistics for Patterns) computes p-value using binomial
approximation. This approximation is known to be false but is in fact a very
fast and reliable heuristic (check the benchmarks section).
- G-SPatt (Gaussian Statistics for Patterns) computes expectation and
variance for pattern counts and derive from these a p-value approximation.
-
LD-SPatt (Large Deviations Statistics for Patterns) is based on the large deviations theory, the computed p-value
are especially reliable for the smallest but are asymptotic and so must be used
with care on short sequences (let say less than 10000 long).
In case of such sequences, exact approachs should be preferred.
(now included in SPatt but the old separated package is still available in
our download section)
-
X-SPatt (eXact Statistics for Patterns) proposes to use exact computations
to give high quality p-values. Memory requirements are
growing linearly with number of occurrences and time complexity is proportional both to sequence
length and pattern number of occurrences. Therefore, this method should take long time if used on long
sequence (let say more than 10000 long).
-
CP-SPatt (Compound Poisson Statistics for Patterns) uses Chen-Stein method to
approximate N(P) with a geometric Poisson distribution. Thanks to a nice recurrence,
p-values computations are linear (and not quadratic) with the observed number of occurrences.
In the case of non overlapping patterns, these approximations fall back to the simple
Poisson approximation which is very close to the binomial statistics implemented in S-SPatt.
Please check our reference page for more
details.