( all computation in this part have been done on a 2.8 GHz P4 CPU with 512 Mo RAM memory and Linux 2.4.24)
SPatt have been developed with a great concern about optimization. A simple way to illustrate this can be done comparing the word count function of both SPatt and the popular EMBOSS package.
The following table shows the total running time (in seconds) to compute the frequencies of all words of Ecoli K12 for several lengths:
| length | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EMBOSS | na | 3.90 | 4.29 | 5.15 | 7.31 | 16.32 | 57.03 | 230.68 | 891.43 | na | na | na |
| SPatt | 0.09 | 0.18 | 0.21 | 0.28 | 0.30 | 0.35 | 0.41 | 0.49 | 0.95 | 1.46 | 1.98 | 2.68 |
frequencies of letter (length = 1) are not available in EMBOSS and frequencies for length greater than 10 were not computed (far too long).
As you can see, even for the simple task of reading the sequence, SPatt is more than 10 times faster than EMBOSS. It is also obvious than EMBOSS wordcount complexity is exponential with length of words while SPatt is linear. In fact, the memory allocation for stocking frequencies is always exponential with the length of considered words, but this allocation has poor effect on running time when done efficiently.
At last, let us point out the fact that using a user defined alphabet, SPatt can also deal with frequencies on texts in Latin or on proteins. For example, counting all words of length 5 on SWISSPROT can be done with SPatt in less than 7 seconds.
In this section we compare performance and reliability of the different programs available in the SPatt package:
SPatt-1.0.7 have been used for this benchmark.
| program | time | memory |
|---|---|---|
| X-SPatt (*) (**) | O(k^{2m})+O(n^2/k^h) | O(k^{2m})+O(n) |
| CP-SPatt (*) (**) | O(n/k^{h}) | O(k^{m+1})+O(h) |
| G-SPatt (*) (**) | O(k^{m+1})+O(hr^2) | O(k^{m+1}) |
| LD-SPatt | O(k^h) | O(k^h) |
| S-SPatt | O(k^{m+1})+O(hr) | O(k^{m+1}) |
These complexities are given for the computation of a pattern of size r (number of words in the pattern), of maximum length h (number of characters in each word) on a sequence of length n using an order m Markov model on a size k alphabet.
(*) indicates a program which does not yet provide statistics for patterns other than simple words (which means r is always equal to 1 for such programs).
(**) indicates a program which does not yet provides statistics computation when Markov model order is greater than m=0
The minimum memory requirement for such computations is O(k^{m+1}) for the Markov model parameters. G-SPatt and S-SPatt programs succeed to stick to that minimum complexity. Sequence length is involved in memory complexity in the case of X-SPatt computations as well as with CP-SPatt, but in this last case, it is in fact the observed number of occurrence which is used (approximated here by n/k^h). At last, in the case of LD-SPatt, memory grows exponentially with the length h of the pattern which is a severe limitation of the method.
In terms of time complexity, S-SPatt is clearly the fastest program but G-SPatt is quite similar. CP-SPatt is also fast when used on rare patterns but perform slower on frequent patters. LD-SPatt could be very slow with long patterns and the X-SPatt computation time is highly dependent on the sequence length.
As a conclusion, let say that both S-SPatt and G-SPatt seem suitable for all kind of parameters. CP-SPatt is slower but should remain usable for any kind of parameters (with some patience in the case of high number of pattern occurrences). LD-SPatt cannot be used for long pattern. X-SPatt is limited to short sequences.
We give here the empirical running time to compute the statistics of all words of length h using a standard DNA alphabet on a given sequence (HIV has a length of 9.7 Kb and Bacillus subtilis of 4.2 Mb).
| running time (s) | hiv h=4 | hiv h=7 | bsubtilis h=4 | bsubtilis h=7 |
|---|---|---|---|---|
| xspatt | 631.33 | 1434.55 | na | na |
| gspatt | 0.01 | 0.08 | 0.26 | 0.52 |
| cpspatt | 0.05 | 0.25 | 16.94 | 17.68 |
| ldspatt | 0.31 | 621.17 | 0.55 | 624.26 |
| sspatt | 0.00 | 0.07 | 0.24 | 0.48 |
All measurements have been done with on a P4 2.8 Gz processor with 512 Mo RAM running Linux 2.4.24.
As it was theoretically expected, sspatt and gspatt are the fastest programs. cpspatt works also very well even if running time grows with sequence length. ldspatt is not in the same time range than the other programs except for very short words.
We give here the comparison of the statistics of all words of length h using a standard DNA alphabet on two sequences (HIV has a length of 9.7 Kb and Bacillus subtilis of 4.2 Mb).
Rank accordance is measured on the tail distribution events using the classical Kendall's tau (which is equal to 1.0 in the case of perfect accordance and equal to -1.0 in the case of perfect discordance). Relative error is simple the ratio abs(test-ref)/ref.
For the moment, only the M0 case is available, but preliminary studies indicates that results should be very similar with higher order models.
(click on a plot to enlarge it in a new window)
|
|
|
|
(click on a plot to enlarge it in a new window)
As expected, G-SPatt lack of reliability for tail distribution events. As a consequence Kendall's tau is roughly around to 0.85 which the worse
Large deviations statistics have their characteristic shape where central distribution events are not very well handled while things gets better when events fall into the tail distribution. With a Kendall's at 0.98, LD-SPatt performs the best for over-represented words.
S-SPatt performs very well except for overlapping words (such as the two isolated points which are, from left to right, gggg and tttt).
CP-SPatt correct this problem and gets the best Kendall's tau 1.0 for the under-represented words (where the Chein-Stein approximation is predicted to be the best).
(click on a plot to enlarge it in a new window)
|
|
|
|
(click on a plot to enlarge it in a new window)
For G-SPatt, reliability falls dramatically for over-represented words with a Kendall's tau of 0.55. The particular shape of the plot is due the low number of occurrences: all under-represented words occur N=0 time, and we can see clusters of words occurring (from left to right) N=1 time, N=2 times, N=3 times, and so on.
For LD-SPatt, their is a cluster of under-represented words for which statistics are very good. These words are simply the ones occurring N=0 times and in this particular case, a special simplification in the computation is used.
S-SPatt performs again very well, even if self-overlapping word statistics are still a problem.
CP-SPatt gives the best statistics both in terms of relative error and Kendall's tau.
(click on a plot to enlarge it in a new window)
|
|
|
|
(click on a plot to enlarge it in a new window)
As exact computation can not be done on such a large sequence (4.2 Mb), we take here CP-SPatt as our new reference.
G-SPatt is not very good but still acceptable. LD-SPatt and S-SPatt seem to perform better.
(click on a plot to enlarge it in a new window)
|
|
|
|
(click on a plot to enlarge it in a new window)
As in the last case, CP-SPatt is taken as our reference.
G-SPatt shows here an obvious lack of reliability while S-SPatt and LD-SPatt are still very good (LD-SPatt slightly better).
In this section we compare SPatt to several other softwares.
We compare here the results obtained with four softwares: rmes.gaussien 2.1.2p3 (shuffle Gaussian approximations), rmes.poisson.composee 2.1.2p3 (compound Poisson approximations), ldspatt 0.9.2 (large deviations approximations) and sspatt 0.9 (binomial approximations).
| method | time | memory |
|---|---|---|
| exact | O(k^{2m})+O(r^2n^2/k^h) | O(k^{2m})+O(n) |
| compound Poisson (*) | O(n^2/k^{2h}) | O(k^{m+1})+O(h+n/k^h) |
| Gaussian | O(k^{m+1})+O(hr^2) | O(k^{m+1}) |
| large deviations | O(k^h) | O(k^h) |
| binomial | O(k^{m+1})+O(hr) | O(k^{m+1}) |
These complexities are given for the computation of a pattern of size r (number of words in the pattern), of maximum length h (number of characters in each word) on a sequence of length n using an order m Markov model on a size k alphabet.
(*) note that compound Poisson approximation implemented in RMES does not provide statistics for patterns other than simple words (which means r is always equal to 1 for such approximations).
The minimum memory requirement for such computations is O(k^{m+1}) for the Markov model parameters. Gaussian and binomial approximation succeed to stick to that minimum complexity. Sequence length is involved in memory complexity in the case of exact computations as well as with the compound Poisson approximations, but in this last case, it is in fact the observed number of occurrence which is involved (approximated here by n/k^h). At last, in the case of large deviations, memory grows exponentially with the length h of the pattern which is a severe limitation of the method.
In terms of time complexity, binomial methods are clearly the fastest but Gaussian ones are quite similar. Compound Poisson approximations are also fast when used on rare patterns. Large deviations could be very slow with long patterns and the exact computation time is highly dependent on the sequence length.
As a conclusion, let say that both binomial and Gaussian approximations are suitable for all kind of parameters. Compound Poisson approximations should not be used in the case of high number of pattern occurrences. Large deviations cannot be used for long pattern. Exact computations are limited to short sequences.
We give here the empirical running time for different software computing statistics for all words of length h on a standard DNA alphabet using an order m Markov model with Mycoplasma genitalium (580 Kb).
| software | time (in seconds) | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rmes.gaussien |
|
||||||||||||||||||||||||||||||||||||||||||
| rmes.poisson.composee |
|
||||||||||||||||||||||||||||||||||||||||||
| ldspatt |
|
||||||||||||||||||||||||||||||||||||||||||
| sspatt |
|
All measurements have been done on a P4 2.8 Gz processor with 512 Mo RAM running linux 2.4.24.
As it was theoretically expected, sspatt is the fastest program. rmes.gaussien is close but lack of efficiency dealing with high order Markov model. rmes.poisson.composee works quite well with words of high length (as they are rare enough) but time grows dramatically when word length goes down. ldspatt is not at all in the same time range than the other programs except for very short words.
Absolute values
Relative errors
Ranks
Two sequences have been used: a short one (phage-lambda) and a long one (ecoli).
Two length of words have been used: short one (h=4) and a long one (h=7)
In all cases we use an order 1 markov model estimated on the observed sequence (in the case of rmes.gaussian, the model is the shuffle one but it is assumed here to be close enough to the Markovian one to allow proper comparisons).
phage-lambda and h=4
phage-lambda and h=7
ecoli and h=4
ecoli and h=7
| rmesg vs rmescp | rmesg vs ldspatt |
|
|
| rmesg vs sspatt | rmescp vs ldspatt |
|
|
| rmescp vs sspatt | ldspatt vs sspatt |
|
|
Comparisons between rmescp and all other programs show a group of words for which rmescp statistics are quite different than all other ones. As this cannot be explained by theory, we conclude here than rmescp have an implementation-related problem resulting sometimes in false statistics.
rmesg results are slightly different than all other approach when focusing on the extremes. This is not a surprise as Gaussian approximations are not expected to be very efficient for rare events.
ldspatt, sspatt and rmescp (for non error statistics) are quite similar even if we can see a small bias for large deviations (which are expected to get better on low p-values). We can also see some words with different statistics with rmescp and sspatt, nevertheless, as these differences are not really confirmed in ldspatt/sspatt comparison it is likely that sspatt should be more trusted than rmescp in this case.
| rmesg vs rmescp | rmesg vs ldspatt |
|
|
| rmesg vs sspatt | rmescp vs ldspatt |
|
|
| rmescp vs sspatt | ldspatt vs sspatt |
|
|
Using longer words help us to remove rmescp errors. As we got also higher statistics, rmesg bias in the extreme gets more sensitive. ldspatt still seems to lack of accuracy even if ranks are quite well preserved in comparison with rmescp or sspatt. At last, some differences can be seen between rmescp and sspatt but it is hard to conclude that one is better than another.
| rmesg vs rmescp | rmesg vs ldspatt |
|
|
| rmesg vs sspatt | rmescp vs ldspatt |
|
|
| rmescp vs sspatt | ldspatt vs sspatt |
|
|
Here rmescp has a lot of problems. First, errors seen with phage-lambda and h=4 get worse and secondly, rmescp seems incapable to deal with p-value lower than 1e-300 (which correspond to statistics smaller than -300 or higher than +300).
Without considering these false statistics, rmescp, ldspatt and sspatt seems to give very similar statistics. The ones from rmesg are quite different from the three others but this point, as explained before, is theoretically expected.
| rmesg vs rmescp | rmesg vs ldspatt |
|
|
| rmesg vs sspatt | rmescp vs ldspatt |
|
|
| rmescp vs sspatt | ldspatt vs sspatt |
|
|
This last case shows the same kind of behaviour than the last one.
not yet available
not yet available