MAIN
©1996-2007
All Rights Reserved. Online Journal of Bioinformatics . You
may not store these
pages in any form except for your own personal use. All other usage or
distribution is illegal under international copyright treaties.
Permission to
use any of these pages in any other way besides the before mentioned must be gained in
writing
from the publisher. This article is exclusively copyrighted in its
entirety to
OJB publications. This article may be copied once but may not be,
reproduced or re-transmitted
without the
express permission of the editors. This
journal satisfies the refereeing requirements (DEST) for the Higher
Education
Research Data Collection (Australia). Linking:To link to
this page or any pages
linking to this page you must link directly to this page only here
rather than
put up your own page.
OJBTM
Online
Journal of Bioinformatics ©
8 (1) :
30-40, 2007
PIDA:
A new algorithm for pattern
identification
Putonti
C1,2,
Pettitt BM1,2,3,
Reid JG3, Fofanov Y1,2
1Department
of Computer Science, 2Department of Biology and
Biochemistry, and 3Department of Chemistry, University of Houston, Houston, Texas, USA
ABSTRACT
Putonti
C, Pettitt BM, Reid JG, Fofanov Y, PIDA: A new algorithm for pattern identification, Online Journal of Bioinformatics, 8 (1) : 30-40, 2007. Algorithms
for motif identification in sequence space have predominately
been focused on recognizing patterns of a fixed length containing
regions of
perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from
proteins with distinct active sites, to non-coding RNAs
with specific structural elements that are necessary to maintain
functionality. In the event that an
insertion/deletion has occurred within an unconstrained portion of the
pattern,
it is possible that the pattern retains its functionality.
In such a case the length of the pattern is
now variable and may not be overlooked when utilizing existing motif
detection
methods. The Pattern Island Detection
Algorithm (PIDA) presented here has been developed to recognize
patterns that
have occurrences of varying length within sequences of any size
alphabet. PIDA works by identifying all
regions of
perfect conservation (for lengths longer than a user-specified
threshold), and
then builds those conservation “islands” into fixed-length patterns. Next the algorithm modifies these
fixed-length patterns by identifying additional (and different) islands
that
can then be incorporated into each pattern through insertions/deletions
within
the “water” separating the islands. To
provide some benchmarks for this analysis, PIDA was used to search for
patterns
within randomly generated sequences as well as sequences known to
contain
conserved patterns. For each of the
patterns found, the statistical significance is calculated based upon
the
pattern’s likelihood to appear by chance, thus providing a means to
determine
those patterns which are likely to have a functional role.
The PIDA approach to motif finding is
designed to perform best when searching for patterns of variable length
although it is also able to identify patterns of a fixed length. PIDA has been designed to be as generally
applicable as possible since there are a variety of sequence problems
of this
type, from transcription factor binding sites in DNA, to structural
motifs in
non-coding RNA, to high-contact-order domains in certain proteins. The
algorithm was implemented in C++ and is freely available upon request
from the
authors.
KEY WORDS: pattern
discovery, motif conservation,
variable length patterns
MAIN
FULL-TEXT
(SUBSCRIPTION)