References
Here is a list of references sorted by topic of interest:
Programming libraries
Large scale eigenproblems
Large Deviations
Gaussian approximations
Compound Poisson approximations
Exact statistics
Similar softwares
Other
Programming libraries (top)
- STL
the C++ Standard Template Library.
Used in SPatt for vector and map features.
- GSL
the C GNU Scientific Library.
Used in SPatt for 1D real minimization using Brent's algorithm.
- Arpack
Fortran 77 linear algebra library for Arnoldi class algorithms. Used
in SPatt for stationary distribution computations as well as
in the computation of large deviations statistics.
- Tclap
a command line parsing library written in C++.
- Lapack
The old fortran 77 library on linear algebra.
Used in SPatt for all eigenvalues related problems.
Large scale eigenproblems (top)
- Arnoldi, W. E.
The principle of minimised iterations in the solution of the matrix eigenvalue problem.
Quart. Appl. Math. (1951)
9, 17-29.
The original article from Arnoldi himself.
- Radke, R. J.
A Matlab Implementation of the Implicitly Restarted Arnoldi Method for Solving
Large-Scale Eigenvalue Problems.
Thesis of the Rice University, Houston, Texas (1996)
A clear presentation of Arnoldi class algorithms.
Large Deviations (top)
- Den Hollender, F.
Large Deviations.
American, Mathematical Society, Field Institute Monographs 14. (2000)
A very good reference on the large deviations.
- Nuel, G.
Grandes déviations et chaînes de Markov pour l'étude des
occurrences de mots dans les séquences biologiques.
Thesis of the Evry University, France (2001)
(pdf version available here)
This thesis presents the main results about large deviations statistics for patterns.
- Nuel, G.
LD-SPatt: Large Deviations Statistics for Patterns on Markov Chains.
J. C. B. (2004)
11 (6): 1023-1033.
An article presenting the results used in LD-SPatt implementation.
Gaussian approximations (top)
- Kleffe, J. and Borodovsky, M.
First and second moment of counts of words in random texts
generated by Markov chains.
Comput. Appl. Biosci. (1992)
8: 433-441.
Give exact expression of expectation and covariance of words counts in the markov
model of any order.
- Prum, B., Rodolphe, F. and Turckheim, E.
Finding words with unexpected frequencies in
desoxyribonucleic acid sequences.
J. Roy. Statist. Soc. Ser. B. (1995)
57: 205-220.
Give exact expression of expectation and covariance of words counts in the shuffle
model of any order.
Compound Poisson approximations (top)
- Chrysaphinou, O. and Papastavridis, S.
A limit theorem on the number of overlapping appearances of a
pattern in a sequence of independent trials.
Proba. Theory Relat. Fields (1988)
79 (1): 129-143.
First give the Polya-Aeppli (also called Geometric Poisson) approximation for
a pattern count on an order 0 Markov chain.
- Arratia, R.; Goldstein, L. and Gordon, L.
Poisson approximation and the Chen-Stein method.
Stat. Sci. (1990)
5 (4): 403-434.
Proposes a compound Poisson approximation for pattern count on a Markov chain.
- Schbath, S.
Compound Poisson approximation of word counts in DNA sequences.
ESAIM, Probab. Stat. (1995)
1: 1-16.
Proposes compound Poisson approximations for pattern count on a Markov chain.
- Robin, S.
A compound Poisson model for word occurrences in DNA sequences
J. Roy. Stat. Soc. Ser. (2002)
51: 437-451.
Shows that, as in the order 0 case, the compound Poisson approximation fall
in the particular case of the Polya-Aeppli distribution.
- Barbour, A. D.; Chen, L. and Loh, W.-L.
Compound Poisson approximation for nonnegative random variables
via Stein's method.
Ann. Proba. (1992)
20: 1843-1866.
Gives formulas for the general compound Poisson distribution and for the particular
case of the Polya-Aeppli distribution.
- Press, W. H.; Teukolsky, S. A.; Vetterling W. T. and Flannery, B. P.
Numerical Recipes in C
Cambridge University Press (1988)
Available here.
For simple Poisson and binomial distributions. Used in S-SPatt.
- Abramowitz and Stegun
Handbook of Mathematical Functions
Wiley-Interscience Publication (1972).
Available here.
For references on the Kummer distribution. Used in CP-SPatt.
- Nuel, G.
Cumulative Distribution Function of a Geometric
Poisson Distribution
Journal of statistical computation and simulation (2006), in press.
(preprint).
Presents the new linear algorithm (rather than quadratic) to compute
the cdf of a Geometric Poisson distribution (which is implemented in
CP-SPatt)
Exact statistics (top)
- Robin, S. and Daudin, J.-J.
Exact distribution of word occurrences in a random sequence
of letters.
J. Appl. Proba. (1999)
36: 179-193.
Method implemented in oldX-SPatt.
- Fu, J. C.
Distribution theory of runs and patterns associated with a
sequence of multi-state trials.
Statistica Sinica (1996)
6(4): 957-974.
- Nuel, G.
Effective p-value computations using Finite Markov Chain
Imbedding (FMCI): application to local score and to
pattern statistics.
Algo. Mol. Biol. (2006) 1(5)
(direct link in journal)
The new fast and reliable exact algorithms which are
implemented in X-SPatt.
Similar softwares (top)
- RMES
A software computing Gaussian approximations (in the shuffle case, not in the Markov one),
for any DNA pattern and compound Poisson approximations (in the Markov case) for
non degenerate DNA patterns. The software is written in C++ and is open source. Beware,
there is an open numerical issue with compound Poisson computations.
- REGEXPCOUNT which
is part of the ALGOLIB, a set of Maple packages. This software allow to compute exact
distribution using automatas, but is highly limited both in the Markov model order and in
the number of occurrences of the pattern. Especially efficient with high degenerate patterns
such as those of the Prosite database. Development of this software seems not very active.
- Quickscore is a C library
designed for pattern exact computations using generative series. Unfortunatly, this library
is not available anymore. A on-line version of it is still under development.
- RSAT
a complete set of online tool dedicated to the seach of regulatory sequence
analysis. A great reference.
Other (top)
- Nuel, G.
S-SPatt: Simple Statistics for Patterns on Markov chains.
Application note in Bioinformatics Volume 21, Number 13 pp. 3051-3052.
(
link in bioformatics).
A short article presenting the binomial approximations used in S-SPatt.
- Nuel, G.
Pattern statistics on Markov chains and
sensitivity to parameter estimation
Algorithms for Molecular Biology (2006) 1(17).
(direct link in journal)
This paper propose to study the consequence of parameter estimations on
pattern statistics (in the binomial case). The conclusion is that high
order Markov model (e. g. m>4 for DNA) should not be used.
- Nuel, G.
Numerical solutions for
Patterns Statistics on Markov chains
Journal of Statistical Applications in Genetic and Molecular Biology (2006) 5(1):26.
(direct link in journal)
An overview paper talking about almost all the statistics implemented
in SPatt and providing the most complete Benchmark available on the subject
at the present time.
- Nuel, G.
Distribution of patterns on Markov chains: a unified approach using
deterministic finite state automata
Journal of Applied Probability (2006), submitted.
(preprint)
This article presents the new DFA approach to the pattern problem
which is the heart of all methods implemented in the SPatt 2.x serie.
Thanks to this new approach it is now possible to consider overlap or
renewal occurrences of higly degenetaive patterns such as gapped ones
(ex: atgg.(12-16)ctggt) or PROSITE patterns.
- Nuel, G.
PMC pour l'étude des occurrences de motifs dans les séquences
markoviennes
HDR university of Evry
(pdf
slides)
This document proposes a synthesis of all my research work since my PhD.
Pattern statistics techniques using PMC are presented in detail (main document
in french but slides in english).