Skip to contents

This function performs Principal Component Analysis on a gen_tibble, using a fast truncated SVD with initial pruning and then iterative removal of long-range LD regions. This function is a wrapper for bigsnpr::snp_autoSVD()

Usage

gt_pca_autoSVD(
  x,
  k = 10,
  fun_scaling = bigsnpr::snp_scaleBinom(),
  thr_r2 = 0.2,
  use_positions = TRUE,
  size = 100/thr_r2,
  roll_size = 50,
  int_min_size = 20,
  alpha_tukey = 0.05,
  min_mac = 10,
  max_iter = 5,
  n_cores = 1,
  verbose = TRUE
)

Arguments

x

a gen_tbl object

k

Number of singular vectors/values to compute. Default is 10. This algorithm should be used to compute a few singular vectors/values.

fun_scaling

Usually this can be left unset, as it defaults to bigsnpr::snp_scaleBinom(), which is the appropriate function for biallelic SNPs. Alternatively it is possible to use custom function (see bigsnpr::snp_autoSVD() for details.

thr_r2

Threshold over the squared correlation between two SNPs. Default is 0.2. Use NA if you want to skip the clumping step. size

use_positions

a boolean on whether the position is used to define size, or whether the size should be in number of SNPs. Default is TRUE

size

For one SNP, window size around this SNP to compute correlations. Default is 100 / thr_r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.

roll_size

Radius of rolling windows to smooth log-p-values. Default is 50.

int_min_size

Minimum number of consecutive outlier SNPs in order to be reported as long-range LD region. Default is 20.

alpha_tukey

Default is 0.1. The type-I error rate in outlier detection (that is further corrected for multiple testing).

min_mac

Minimum minor allele count (MAC) for variants to be included. Default is 10.

max_iter

Maximum number of iterations of outlier detection. Default is 5.

n_cores

Number of cores used. Default doesn't use parallelism. You may use bigstatsr::nb_cores().

verbose

Output some information on the iterations? Default is TRUE.

Value

a gt_pca object, which is a subclass of bigSVD; this is an S3 list with elements: A named list (an S3 class "big_SVD") of

  • d, the eigenvalues (singular values, i.e. as variances),

  • u, the scores for each sample on each component (the left singular vectors)

  • v, the loadings (the right singular vectors)

  • center, the centering vector,

  • scale, the scaling vector,

  • method, a string defining the method (in this case 'autoSVD'),

  • call, the call that generated the object.

Note: rather than accessing these elements directly, it is better to use tidy and augment. See gt_pca_tidiers. Note: If you encounter 'Error in rollmean(): Parameter 'size' is too large.' roll_size is exceeding the number of variants on at least one of your chromosomes. If you have pre-specified roll_size, you will need to reduce this parameter. If not, try specifying a reduced 'roll_size' to avoid this error.

Details

Using gt_pca_autoSVD requires a reasonably large dataset, as the function iteratively removes regions of long range LD.