Technology Description: Similarity search methods are important for data mining in many fields, such as biology, finance, and artificial intelligence. In molecular biology, for example, computer-assisted sequence comparison is widely recognized as an essential analytical tool. The quality of an alignment is usually characterized by an alignment score. Statistical significance (e.g., a p-or E- value) needs to be assigned to the alignment score in order to interpret the result. This statistics assessment is however a difficult and time-consuming task for the existing Smith-Waterman-type or HMM-based algorithms, whenever gaps are involved.
A novel hybrid algorithm has recently been devised so that the alignments it generates obey universal statistics, without sacrificing the sensitivity and computational cost compared to the best of the existing methods. The alignment scores of the hybrid algorithm have been shown to satisfy the so-called Gumbel distribution, with the few Gumbel parameters readily computable, for a large class of scoring functions and for a wide range of sequence lengths and amino acid frequencies. These allow one to rapidly characterize the results of sequence comparison, without the need for extensive simulation for each scoring functions as it is currently done.
The hybrid algorithm has been successfully tested on known classes of protein sequences. For more information, see: Y.-K. Yu and T. Hwa, Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models, J. Comp. Biol., submitted for publication.
Advantages:
Case Number: SD2001-042
Inquiries To : invent@ucsd.edu