Parallel Testing

library(segtest)

Introduction

We provide a walk-through of how to run tests for segregation distortion at many loci in parallel. Details of the methods may be found in Gerard et al. (2025a) and Gerard et al. (2025b).

Analysis

Data sets ufit, ufit2, and ufit3 contain the genotyping output of updog::multidog() using three different models.

You can convert this genotyping output to what seg_multi() expects using multidog_to_g().

If you did not use the f1pp or f1 models, use ether the all_gl (to run tests using genotype log-likelihoods) or all_g (to run tests assuming genotypes are known) options, and you must specify the ID’s of the parents.

o1 <- multidog_to_g(ufit, ploidy = 4, type = "all_g", p1 = "indigocrisp", p2 = "sweetcrisp")
o2 <- multidog_to_g(ufit, ploidy = 4, type = "all_gl", p1 = "indigocrisp", p2 = "sweetcrisp")

If you did use the f1pp or f1 models, use either the off_gl (to run tests using genotype log-likelihoods) or off_g (to run tests assuming genotypes are known) options.

o3 <- multidog_to_g(ufit2, ploidy = 4, type = "off_g")
o4 <- multidog_to_g(ufit2, ploidy = 4, type = "off_gl")
o5 <- multidog_to_g(ufit3, ploidy = 4, type = "off_g")
o6 <- multidog_to_g(ufit3, ploidy = 4, type = "off_gl")

We would recommend always using genotype log-likelihoods. But the option is there for known genotypes.

Parallelization support is provided through the future package. You specify how to implement parallelization using future::plan(). Then you run seg_multi(). Then you shut down the parallelization with future::plan(). The most common plan is future::multisession(), where you specify the number of parallel processes with the workers argument. You can get the maximum number of workers via future::availableCores()

future::availableCores()
#> system 
#>     16

So a typically workload looks something like this:

future::plan(future::multisession(workers = 2))
mout1 <- seg_multi(
  g = o2$g, 
  p1 = o2$p1,
  p2 = o2$p2, 
  p1_ploidy = 4, 
  p2_ploidy = 4,
  ret_out = TRUE)
future::plan(future::sequential())
#> Loading required package: foreach
#> Loading required package: rngtools

The output is a data frame. The most important parts of this are the snp and p_value columns.

mout1[c("snp", "p_value")]
#>            snp    p_value
#> 1   1_44673475 0.20525001
#> 2  11_28836119 0.20013872
#> 3  12_31029646 0.01800832
#> 4   12_8487773 0.99521289
#> 5    2_1426329 0.45701691
#> 6   2_20070837 0.13228993
#> 7   2_40108108 0.26596888
#> 8   4_37820371 0.21251272
#> 9   6_30619745 0.61700707
#> 10   6_4350249 0.64950122

It looks like only SNP 12_31029646 is in possible segregation distortion. Let’s compare the expected frequencies (from the q0 column) against the observed modes.

round(mout1$q0[mout1$snp == "12_31029646"][[1]], digits = 4)
#> [1] 0.2008 0.5985 0.2008 0.0000 0.0000
o1$g["12_31029646", ] / sum(o1$g["12_31029646", ])
#>         0         1         2         3         4 
#> 0.1541667 0.5833333 0.2625000 0.0000000 0.0000000

So SNP 12_31029646 likely got flagged because there are too many individuals with a genotype of 2, and not enough with a genotype of 0.

If we would have run outlier = FALSE, then SNP 2_20070837 would also have been flagged for possible segregation distortion

future::plan(future::multisession(workers = 2))
mout2 <- seg_multi(
  g = o2$g,
  p1 = o2$p1, 
  p2 = o2$p2, 
  p1_ploidy = 4, 
  p2_ploidy = 4,
  outlier = FALSE)
future::plan(future::sequential())
mout2[, c("snp", "p_value")]
#>            snp     p_value
#> 1   1_44673475 0.205250155
#> 2  11_28836119 0.200138927
#> 3  12_31029646 0.018008343
#> 4   12_8487773 0.995749454
#> 5    2_1426329 0.457017169
#> 6   2_20070837 0.001131654
#> 7   2_40108108 0.265969050
#> 8   4_37820371 0.212513126
#> 9   6_30619745 0.617007947
#> 10   6_4350249 0.649501842

Let’s look at the posterior mode genotypes of SNP 2_20070837:

o1$g["2_20070837", ]
#>   0   1   2   3   4 
#>  40 166  32   2   0
o1$p1["2_20070837"]
#> 2_20070837 
#>          0
o1$p2["2_20070837"]
#> 2_20070837 
#>          2

So SNP 2_20070837 likely got flagged because of two individuals that are very likely a genotype of 3, which is impossible if the parents have genotypes 0 and 2. Including outlier = TRUE previously made it so that we would ignore this discrepancy (up to a point). You can get the posterior probability that an individual is an outlier if you set ret_out = TRUE (which I previously did).

graphics::plot(
  mout1$outprob[mout1$snp == "2_20070837"][[1]], 
  ylim = c(0, 1),
  ylab = "Outlier Probability")

Scatterplot of SNP index (x-axis) by outlier probability (y-axis).

multi_lrt()

The older function multi_lrt(), only provides support for tetraploids. You can still use it instead of seg_multi() for tetraploids.

future::plan(future::multisession(workers = 2))
mout3 <- multi_lrt(
  g = o2$g,
  p1 = o2$p1, 
  p2 = o2$p2
)
future::plan(future::sequential())

It’s a little faster, but it cannot account for outliers, so it will mostly be the same as setting outlier = FALSE.

plot(
  mout2$p_value,
  mout3$p_value,
  xlab = "seg_multi(outlier = FALSE)", 
  ylab = "multi_lrt()")
abline(0, 1, lty = 2, col = 2)

Scatterplot of p-values from seg_lrt() (x-axis) by multi_lrt() (y-axis).

The one discrepancy above is caused by the new way we calculate the degrees of freedom:

i <- which.max(abs(mout2$p_value - mout3$p_value))
mout2$df[[i]]
#> [1] 2
mout3$df[[i]]
#> [1] 3

The two approaches should give almost equivalent results most of the time, but the new way is a little better.

References

Gerard D, Thakkar M, & Ferrão LFV (2025a). “Tests for segregation distortion in tetraploid F1 populations.” Theoretical and Applied Genetics, 138(30), p. 1–13. doi:10.1007/s00122-025-04816-z.

Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025b). “Tests for segregation distortion in higher ploidy F1 populations.” bioRxiv, p. 1–20. bioRxiv:2025.06.23.661114

mirror server hosted at Truenetwork, Russian Federation.