What's new in SPatt 2.x

Introduction

The new SPatt branch 2.x uses DFA both to count occurrences and to perform statstistical computations. The technique involves the use of PMC (Pattern Markov Chains, check the reference for more detail) and allows to deal with higly degenerated patterns such as gapped ones (ex: atgtg.(12-15).tggat) or even Prosite ones.

Right now, this branch is still under heavy development and not all feature of the 1.x branch are yet implemented. However, SPatt 2.x is already fully functional for exact computations (including repartition distribution, a new feature) and Gaussian approximations.

Please note that there is only one program called "spatt" in the 2.x branch. The different statistical methods are now available through command-line options rather than specific programs (ex: "spatt --gaussian" rather than "gspatt").

Some examples

Let us consider the pattern "aba.(0-3)baa" over the binary alphabet {a,b}.

We can build the DFA associated to this pattern and the corresponding PMC (in the M00 model) with the following command:

$ spatt -a ab -p "aba.(0-3)baa" -m -1 --dfa dfa.dot
(note that adding the option "-r" to this command line will turn the program to study renewal occurrence of the pattern rather than overlapping ones).

We hence can visualize the DFA using the dot program from the Graphviz project:

$ dot -Grankdir=LR -Nfontsize=40 -Efontsize=40 -Tps dfa.dot -o dfa.eps
which gives
dfa of aba.(0-3)baa
and the file pmc.sci contains the Scilab definition of the PMC.

It is then possible to study the distribution of this pattern with several methods:

$ spatt -a ab -p "aba.(0-3)baa" -m -1 -S ab1000.fasta
distribution:
P(N=0)=1.451793e-24
P(N=1)=8.311068e-23
[output truncated]
P(N=39)=8.389673e-03
P(N>=40)=9.733052e-01
pattern=aba.(0-3)baa    Nobs=39 P(N<=Nobs)=2.669477e-02
gives the exact distribution of the pattern
$ spatt -a ab -p "aba.(0-3)baa" -m -1 -S ab1000.fasta --repartition
16      6.137695e-01    1
51      1.918259e-01    1
66      3.837891e-01    1
101     1.918259e-01    1
116     3.837891e-01    1
[output truncated]
951     1.918259e-01    1
966     3.837891e-01    1
1000    1.918259e-01    0
gives the occurrence positions and associate a waiting time p-value to each observation.
$ spatt -a ab -p "aba.(0-3)baa" -m -1 -S ab1000.fasta --gaussian
pattern=aba.(0-3)baa    Nobs=39 mean=52.402344  sd=6.872593     z-score=-1.950115       P(N<=Nobs)=2.558123e-02
performs a Gaussian approximation.

Please note that it is possible to use order m>=0 Markov model but, unlike SPatt branch 1.x, the parameter must be provided through the "-M" option. If you want to use parameter estimated over a sequence, the simplest way to do this is to use SPatt banch 1.x to perform the estimation.

Here is an example: DNA pattern "g.tggtgg.(0-12)g.tggtgg" on Escherichia coli K12 complete genome

$ sspatt U00096.fna -m 3 -M tmp.markov
$ spatt -a acgt -p "g.tggtgg.(0-20)g.tggtgg" -m 3 -M tmp.markov -S U00096.fna --gaussian --over
pattern=g.tggtgg.(0-20)g.tggtgg Nobs=14 mean=2.173635   sd=1.510378     z-score=7.830070        P(N>=Nobs)=2.437999e-15