NANUQ\(^+\) is a collection of routines for inference of level-1 network features under the network multispecies coalescent (NMSC) model. Notably, they do not assume that the entire network is level-1. Taking as input a collection of unrooted topological gene trees, output is an unrooted (semidirected) topological level-1 network that captures species relationships, displaying cycles and cut edges (tree-like network features). Complex blobs (those that are not cycles) may be left as multifurcations of degree greater than 3 (polytomies) in this structure.
The initial step infers the unknown network’s tree of blobs, using
the function TINNIK
(described in another vignette). This
tree partially represents species relationships, showing only cut edges
and contracting all reticulation cycles to multifurcations.
Multifurcations suspected to arise from cycles can then be resolved into
optimal cycle structures using resolveCycle
. This
resolution relies on a least-squares approach, comparing an empirical
NANUQ distance among taxon groups around a multifurcation to expected
distances for possible cycle configurations. For cycles under a default
size of 10, a full search is fast, while larger cycles are processed
using a quartet-based heuristic.
The user may combine cycle resolutions for some or all of the
individual multifurcations with combineCycle
. Blobs that
are not suspected to arise from cycles can be left unresolved. When all
multifurcations are believed to be cycle-derived,
resolveLevel1
both resolves all multifurcations and
combines the resolutions into fully-resolved level-1 networks.
We recommend using the NANUQ
function before resolving
any multifurcations, to assess which ones may have resulted from network
cycles. For a level-1 network, NANUQ
generates a splits
graph with distinctive structural features called ‘darts’. If the
level-1 assumption is violated, this structural pattern is disrupted,
providing insight into whether resolving blobs as cycles is
warranted.
For blobs that are cycles, all procedures here provide statistically consistent estimates of network structure under the NMSC model.
The NANUQ\(^+\) routines take as input a collection of gene trees for a set of taxa. From multilocus sequence data, standard phylogenetic tools must first be used to infer unrooted topological trees for each gene. Although these trees are not raw data, they are treated as such. If the gene trees are rooted or contain edge lengths, that information is simply ignored in these routines.
We work with an example data set of 16,338 gene trees of 16 Leopardus species, supplied by Lescroart et al. (2023) and analyzed with NANUQ\(^+\) by Allman, Baños, Rhodes, et al. (2024). Gene trees in Newick format can be read from a text file and their displayed quartet trees counted with commands such as:
# read text file of gene trees and count quartets on them
gts<-read.tree(file = 'genetreefile')
tableLeopardusLescroart=quartetTable(gts)
We will use a table of this data supplied within
MSCquartets, pre-computed by the above commands. Both
TINNIK
and NANUQ
can take as input either the
gene tree file or the quartet table.
# load data file containing quartet counts for Leopardus data set supplied with MSCquartets package
# These counts are will be accessed as `tableLeopardusLescroart`.
data(tableLeopardusLescroart)
All functions used here allow gene trees with missing taxa, provided each subset of four taxa appears on at least one tree (ideally, on many). Statistical tests are conducted for each of these subsets, with the sample size being the number of trees displaying the subset.
The good statistical behavior of the analysis below is established under the assumption that the input gene trees are a true sample from the NMSC model. In practice, inferred gene trees, containing some error, are used instead. Poorly inferred gene trees or those lacking sufficient resolution can compromise results.
The function TINNIK
can be used to infer the tree of
blobs for an unknown network. This first applies two hypothesis tests
for each choice of 4 taxa, testing whether the gene tree counts allow
for the rejection of a resolved quartet tree or a star tree for those
species. (See TINNIK
’s vignette or (Allman, Baños, Mitchell, et al. 2024) for more
information.)
# perform TINNIK analysis for gene trees, using defaults
output<-TINNIK(tableLeopardusLescroart)
#> Applying hypothesis test for model T3 to 1820 quartets.
#> Applying hypothesis test for star tree model to 1820 quartets.
# save table of quartet information with p-values
pT<-output$pTable
The default TINNIK
test levels are \(\alpha=.05\) for the null hypothesis “the
quartet has a tree-like relationship”, and \(\beta=.95\) for the null hypothesis “the
quartet has a star-like relationship”. An initial run of
TINNIK
with default test levels is rarely a sufficient
analysis. One should always vary the \(\alpha\) and \(\beta\) to judge robustness of the inferred
tree of blobs to their choice.
As detailed by Allman, Baños, Rhodes, et al. (2024), for this data we choose \(\alpha=5e-29\), imposing a stringent test for judging non-tree-likeness. Note in the first plot below this gives a clear separation between red (quartets interpreted as 4-blobs) and blue (quartets interpreted as tree-like) symbols, allowing for noise in the gene trees without rejecting tree-like-ness too strongly.
For this dataset and test levels we obtained a tree of blobs with two multifurcations, both of degree five.
# perform improved TINNIK analysis to infer the tree of blobs
output<-TINNIK(pT, alpha=5e-29,beta = 0.95)
Output produced by TINNIK
includes the table of quartet
counts and associated test p-values and the tree of blobs. We rename
these to use as input in further functions.
NANUQ
to explore whether multifurcations should
be resolved into cyclesThe NANUQ
function can help detect whether a blob might
be a cycle. On it’s own, NANUQ
is a statistically
consistent level-1 network inference algorithm under the NMSC model. It
takes the same input as TINNIK (a set of gene trees or a quartet table,
along with two test levels) and produces, among other outputs, a
distance measure that can be used as input to the
neighborNet
algorithm (as implemented in the
phangorn
package) to generate a splits network with
specific structural properties. When the model is violated, this
structure is disrupted, providing insight into potential violations of
the level-1 assumption.
Further details on NANUQ and its applications are given by Allman, Baños, and Rhodes (2019) and Rhodes et al. (2020). We recommend using the same test levels but encourage testing alternative levels as well.
# perform NANUQ analysis for table of quartet information and p-values
D<-NANUQ(pT, alpha = 5e-29,beta = 0.95) # Run the NANUQ routine
#> Distance table written to file: NANUQdist_alpha5e-29_beta0.95.nex
NN<-neighborNet(D$dist) # Run the NeighborNet algorithm on the NANUQ distance
plot(NN) # plot the splits-graph with neighborNet
As in (Allman, Baños, Rhodes, et al. 2024), we identified one multifurcation that aligns well with the model hypothesis, displaying a characteristic dart-like structure, and another that does not. We cannot determine whether the latter multifurcation results from a model violation (presumably level-1-ness) or simply noise. Nevertheless, this indicates caution is needed in interpreting findings associated with this multifurcation.
We used the NeighborNet implementation neighborNet
in
the phangorn
package here, to do the full analysis within
R. However, we recommend using SplitsTree (Huson
and Bryant 2006) for the splits graph, as it has been more
thoroughly tested. For more details, see (Rhodes
et al. 2020).
When resolving the multifurcations, the first step is to numerically label all internal nodes in the tree of blobs, so that we and the code can refer to individual multifurcations.
Here the multifurcations are nodes 18 and 20. We will resolve each
one independently using the function resolveCycle
. While
the same test levels as in TINNIK can be used, we suggest experimenting
with different levels.
The output resC18
is a list that includes a network
displaying the best resolution of node 18, with the other multifurcation
remaining unresolved. Other information in this list is needed for
combining resolutions of different multifurcations, or producing the
histogram of residuals for all possible resolutions.
# resolve node 2O
resC20<-resolveCycle(ToB,20,pT,alpha=5e-29,beta=0.95)
#> For node 20, 5 resolutions found.
Note multifurcation 20 produced five optimal resolutions (that is, those 5 resolutions have residual sums of squares that are within a small tolerance). This suggests that this multifurcation may not arise from a cycle, or that the data has too much noise to be sure.
Alternatively, if one suspects that all multifurcations arise from a cycle, one could attempt to resolve all multifurcations to obtain a level-1 network.
#Fully resolve the tree of blobs to a level-1 network
resN<-resolveLevel1(ToB=output$ToB, pTable=output$pTable, alpha=5e-29, beta=0.95,
distance="NANUQ")
#> 2 multifurcations on tree, at nodes 18, 20.
#> For node 20, 5 resolutions found.
The output resN
includes, as its first element, a list
of the 5 level-1 networks that accommodate all the optimal resolutions
that can co-occur.
In reporting any NANUQ\(^+\) analysis, it is essential to report the test levels used, and, as mentioned throughout, we strongly recommend exploring with values beyond the defaults.
NANUQ\(^+\) yielded 5 fully-resolved topological networks for this data analysis.
One can use the Julia package PhyloNetworks (Solı́s-Lemus, Bastide, and Ané 2017) to optimize numerical parameters (edge lengths and hybridization parameters) on these under a quartet pseudo-likelihood criterion, and also use the pseudo-likelihood score to further compare these networks. Note, however, that comparing networks by psuedo-likelihood is a compromise due to the difficulty of computing full likelihoods.
The commands below (based on 2024 PhyloNetworks) provide an example of this. Note that this is Julia code and not R code, and it requires additional installation of software.
using PhyloNetworks #Load package
gts = readMultiTopology("genetreefile") #load gene trees to Julia
q, t = countquartetsintrees(gts; showprogressbar = true) #compute the quartet counts
global df = writeTableCF(q, t) # to get a DataFrame that can be saved to a file later
global empCFs = readTableCF(df)
ntwk = readMultiTopology("Network.nwk", false) # read the topology of the newtwok to be optimized
net = topologyMaxQPseudolik!(ntwk,empCFs) # optimize network parameters to obtain the pseudo-likelihood score
Inferring numerical parameters is only suggested for networks that
are fully resolved, such as those from the function
resolveLevel1
. Attempted optimization on networks that are
partially resolved, where some multifurcations represent unknown blob
structures, may lead to erroneous inference.