©1996-2007
All Rights
Reserved. Online Journal of Bioinformatics .
You may not store these pages in any form
except for your own personal use. All other usage or distribution is
illegal
under international copyright treaties. Permission to use any of these
pages in
any other way besides the before
mentioned must
be gained in writing from the publisher. This article is exclusively
copyrighted in its entirety to OJB publications. This article may be
copied
once but may not be, reproduced or re-transmitted
without the express permission of the editors. This
journal satisfies the refereeing requirements (DEST) for the Higher
Education
Research Data Collection (Australia). Linking:To link to this page or any pages linking
to this
page you must link directly to this page only here rather than put up
your own
page.
OJBTM
Online Journal of Bioinformatics©
8 (2)
: 165-178, 2007
Constructing Phylogeny from Whole Genome Alignment
Chan
PY, Lam TW, Liu CM, Yiu
SM
Department
of Computer Science, The
ABSTRACT
Chan
PY,
Lam TW, Liu CM, Yiu SM, Constructing
Phylogeny from Whole Genome Alignment, Online
Journal of Bioinformatics, 8 (2) :
165-178, 2007. To
reconstruct a phylogeny for a given set of species, most of the
previous approaches
are based on the similarity information derived from a subset of
conserved regions
(or genes) in the corresponding genomes. In some cases, the regions
chosen may
not reflect the evolutionary history of the species and may be too
restricted
to differentiate the species. It is generally believed that the
inference could
be more accurate if whole genomes are being considered. The best
existing solution
that makes use of complete genomes was proposed by Henz
et al.[14] They can construct a phylogeny for 91 prokaryotic genomes in
170 CPU
hours with an accuracy of about 70% (based on the measurement of
non-trivial
splits) while other approaches that use whole genomes can only deal
with no
more than 20 species. Note that Henz et
al. measure
the distance between the species using BLASTN which is not primarily
designed
for whole genome alignment. Also, their approach is not scalable, for
example,
it probably takes over 1000 CPU hours to construct a phylogeny for all
230
prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of
the accuracy
of the phylogeny. In this paper, we propose the followings. (1) To
evaluate the
quality of a phylogeny with respect to a model answer, we suggest using
the
concept of the maximum agreement subtree
as it can
capture the structure of the phylogeny. (2) We propose to use whole
genome
alignment software (such as MUMmer) to
measure the
distances between the species and derive an efficient approach to
generate these
distances. From the experiments on real data sets, we found that our
approach
is more accurate and more scalable than Henz
et al.’s
approach. We can construct a phylogenetic tree for the same set of 91
genomes
with an accuracy more than 20%higher (with respect to both evaluation
measures)
in 2 CPU hours (more than 80 times faster than their approach). Also,
our
approach is scalable and can construct a phylogeny for 230 prokaryotic
genomes
with accuracy as high as 85% in only 9.5 CPU hours.
Keywords: whole
genome
alignment, phylogeny, maximum compatible subtree,
maximum agreement subtree.