Command line examples

Each time, the command line is given after the "$" and the following lines are the results.

A simple word

$ sspatt ecoli.fasta -p gctggtgg -m 1
gctggtgg        499     70.10   +240.760141
An order 1 Markov model is estimated on the sequence ecoli.fasta and statistic in log scale (default) is outputed for pattern gctggtgg. We observe 499 occurrences, expect 70.10, so the pattern is over-represented with a p-value around 1e-240.

With a more complex pattern

$ sspatt ecoli.fasta -p g.tggtgg -m 0
g.tggtgg        1043    294.63  +249.394597
Statistics for pattern {gatggtgg,gctggtgg,ggtggtgg,gttggtgg} are computed first for an order 0 Markov model. We observe 1043 occurrences expecting only 294.63 of them. The pattern is over-represented with a p-value around 1e-249.

All words of a given length

$ sspatt ecoli.fasta -l 4 -m 2 --all-words
aaaa    35124   35104.19        +0.338697
aaac    25253   26618.98        -16.893008
aaag    22788   20425.36        +59.038597
aaat    25736   26752.44        -9.738282
aaca    21870   18864.77        +100.972531
aacc    20444   24098.25        -128.653351
aacg    24404   23571.88        +7.477807
(...)
Statistics for all words of length 4 are computed for an order 2 Markov model.

A very long pattern

$ sspatt swissprot.fasta -l 3 -m 2 -a ARNDCEQGHILKMFPSTWYV \
-p PNEKVVGIYRMTTPSVLLRDLDIIKHVLIKDFESFADRGVEF
PNEKVVGIYRMTTPSVLLRDLDIIKHVLIKDFESFADRGVEF      1       1.998352e-45    +44.699328
Computes the statistic for the (very) long pattern specified on the aminoacid alphabet. As this alphabet has a high cardinal (20), a shorter length for the counted words than the default one must be used (this explains the -l 3 parameter). One occurrence is observed, 2e-45 is expected resulting in an over-representation with a p-value around 1e-44.