PCA controlling for LD for gen_tibble
objects
gt_pca_autoSVD.Rd
This function performs Principal Component Analysis on a gen_tibble
,
using a fast truncated SVD with initial pruning and then iterative removal
of long-range LD regions. This function is a wrapper for bigsnpr::snp_autoSVD()
Usage
gt_pca_autoSVD(
x,
k = 10,
fun_scaling = bigsnpr::snp_scaleBinom(),
thr_r2 = 0.2,
use_positions = TRUE,
size = 100/thr_r2,
roll_size = 50,
int_min_size = 20,
alpha_tukey = 0.05,
min_mac = 10,
max_iter = 5,
n_cores = 1,
verbose = TRUE
)
Arguments
- x
a
gen_tbl
object- k
Number of singular vectors/values to compute. Default is
10
. This algorithm should be used to compute a few singular vectors/values.- fun_scaling
Usually this can be left unset, as it defaults to
bigsnpr::snp_scaleBinom()
, which is the appropriate function for biallelic SNPs. Alternatively it is possible to use custom function (seebigsnpr::snp_autoSVD()
for details.- thr_r2
Threshold over the squared correlation between two SNPs. Default is
0.2
. UseNA
if you want to skip the clumping step. size- use_positions
a boolean on whether the position is used to define
size
, or whether the size should be in number of SNPs. Default is TRUE- size
For one SNP, window size around this SNP to compute correlations. Default is 100 / thr_r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.
- roll_size
Radius of rolling windows to smooth log-p-values. Default is
50
.- int_min_size
Minimum number of consecutive outlier SNPs in order to be reported as long-range LD region. Default is
20
.- alpha_tukey
Default is
0.1
. The type-I error rate in outlier detection (that is further corrected for multiple testing).- min_mac
Minimum minor allele count (MAC) for variants to be included. Default is
10
.- max_iter
Maximum number of iterations of outlier detection. Default is
5
.- n_cores
Number of cores used. Default doesn't use parallelism. You may use
bigstatsr::nb_cores()
.- verbose
Output some information on the iterations? Default is
TRUE
.
Value
a gt_pca
object, which is a subclass of bigSVD
; this is
an S3 list with elements:
A named list (an S3 class "big_SVD") of
d
, the eigenvalues (singular values, i.e. as variances),u
, the scores for each sample on each component (the left singular vectors)v
, the loadings (the right singular vectors)center
, the centering vector,scale
, the scaling vector,method
, a string defining the method (in this case 'autoSVD'),call
, the call that generated the object.
Note: rather than accessing these elements directly, it is better to use
tidy
and augment
. See gt_pca_tidiers
.
Note: If you encounter 'Error in rollmean(): Parameter 'size' is too large.'
roll_size is exceeding the number of variants on at least one of your chromosomes.
If you have pre-specified roll_size, you will need to reduce this parameter.
If not, try specifying a reduced 'roll_size' to avoid this error.