fastmixture algorithm for population genetics clustering

This function implements the fastmixture algorithm for population genetics clustering by calling the python module. If you use this function, make sure that you cite the relevant paper by Santander, Refoyo-Martínez, and Meisner (2024).

gt_fastmixture(
  x,
  k,
  n_runs = 1,
  threads = 1,
  seed = 42,
  iter = 1000,
  tole = 1e-09,
  batches = 32,
  supervised = NULL,
  check = 5,
  power = 11,
  chunk = 8192,
  subsample = 0.7,
  min_subsample = 50000,
  max_subsample = 5e+05,
  als_iter = 1000,
  als_tole = 1e-04,
  no_freqs = TRUE,
  random_init = TRUE,
  safety = TRUE
)

Arguments

x: either a tidypopgen::gen_tibble, or the name of the binary plink file (without the .bed extension)
k: the number of ancestral components (clusters), either a single value or a vector
n_runs: the number of repeats for each k value
threads: the number of threads to use (1)
seed: the random seed (defaults to 42);it should be a vector of length repeats
iter: the maximum number of iterations (1000)
tole: the tolerance in log-likelihood units between iterations (1e-9)
batches: the number of maximum mini-batches (32)
supervised: the name fo the file with the supervised labels (NULL)
check: the number of iterations to check for convergence (5)
power: number of power iterations in randomised SVD (11)
chunk: the number of SPs in chunk operations (8192)
subsample: Fraction of SNPs to subsample in SVD/ALS (0.7)
min_subsample: Minimum number of SNPs to subsample in SVD/ALS (50000)
max_subsample: Maximum number of SNPs to subsample in SVD/ALS (500000)
als_iter: the maximum number of iterations in the ALS algorithm (1000)
als_tole: the tolerance for the RMSE of P between iterations (1e-4)
no_freqs: do not save P-matrix (TRUE)
random_init: random initialisation of parameters (TRUE)
safety: add extra safety steps in unstable optimizations (TRUE)

Value

an object of class gt_admix. See tidypopgen::gt_admixture() for details.

Details

This function returns a q_matrix that can be plotted with autoplot, and tidied with tidy methods from the tidypopgen package.

References

C. G. Santander, A. Refoyo Martinez, J. Meisner (2024) Faster model-based estimation of ancestry proportions. bioRxiv 2024.07.08.602454; doi: https://doi.org/10.1101/2024.07.08.602454