|
Similarity search methods are important for data mining
in many fields, such as biology, finance, and artificial
intelligence. In molecular biology, for example, computer-assisted
sequence comparison is widely recognized as an essential
analytical tool. The quality of an alignment is usually
characterized by an alignment score. Statistical significance
(e.g., a p-or E- value) needs to be assigned to the
alignment score in order to interpret the result. This
statistics assessment is however a difficult and time-consuming
task for the existing Smith-Waterman-type or HMM-based
algorithms, whenever gaps are involved.
A novel hybrid algorithm has recently been devised
so that the alignments it generates obey universal
statistics, without sacrificing the sensitivity
and computational cost compared to the best of the existing
methods. The alignment scores of the hybrid algorithm
have been shown to satisfy the so-called Gumbel distribution,
with the few Gumbel parameters readily computable, for
a large class of scoring functions and for a wide range
of sequence lengths and amino acid frequencies. These
allow one to rapidly characterize the results of sequence
comparison, without the need for extensive simulation
for each scoring functions as it is currently done.
The hybrid algorithm has been successfully tested on
known classes of protein sequences. For more information,
see: Y.-K. Yu and T. Hwa, Statistical Significance of
Probabilistic Sequence Alignment and Related Local Hidden
Markov Models, J. Comp. Biol., submitted for
publication.
Advantages:
- Provides universal statistics for a large class
of similarity detection algorithms.
- Score statistics obey Gumbel distribution with lambda=1
for a wide range of gap costs and substitution matrices.
- No need for time-consuming simulations in order
to assign statistical significance.
- 1000-fold savings in computer time for statistical
evaluation.
- Same sensitivity as best existing methods.
- Similar computational complexity as for existing
methods.
- Easily implemented without sacrificing computational
efficiency.
- Can easily incorporate BLAST heuristics for large
database searches.
- Applicable to position-specific scoring system.
- Extends to many similarity search problems, such
as in finance (e.g., correlations among financial
time series) and artificial intelligence (e.g., enhanced
recognition of speech patterns).
CASE NUMBER: SD2001-042
INQUIRES TO: invent@ucsd.edu
|