Skip to contents

This vignette is a benchmark of a few key functions in the tidypopgen package. The Human Genome Diversity Project (HGDP) SNP dataset is used for this benchmark. This dataset includes 1043 individuals from 51 populations, typed at ~650k loci.

We will run this benchmark on a machine with a GenuineIntel, 13th Gen Intel(R) Core(TM) i9-13900, 24 CPU and 126 GiB of RAM. However, we will limit the number of cores used to 20. Parallelised functions within tidypopgen use an n_cores argument for the user to set. However, to prevent a behind-the-scenes inflation of the number of threads used (for example, in cases where dependency functions may automatically use all available cores) we need to set up preferences at the beginning of the session. Specifically, we limit the number of cores used by the parallelised BLAS library with bigparallelr::set_blas_ncores(), and by the package data.table with data.table::setDTthreads().

n_cores <- 20

data.table::setDTthreads(n_cores)
bigparallelr::set_blas_ncores(n_cores)

We can now load the necessary libraries:

Before running this benchmark, we need to download into the subdirectory data/hgdp/ the bed file hgdp650.qc.hg19.bed and the metadata file HGDPid_populations.txt.

bed_path <- "./data/hgdp/hgdp650.qc.hg19.bed"
meta_info <- readr::read_tsv("./data/hgdp/HGDPid_populations.txt")
#> Rows: 1064 Columns: 6
#> ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (6): Id, Sex, population, Geographic_origin, Region, Pop7Groups
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create gen_tibble object

Our first step is to load the HGDP data into a gen_tibble object, and add its associated metadata.

hgdp <- gen_tibble(bed_path,
  quiet = TRUE,
  backingfile = tempfile("test_"),
  n_cores = n_cores
)

read_plink: 2.7s

Add metadata

hgdp <- hgdp %>% mutate(
  population = meta_info$population[match(hgdp$id, meta_info$Id)],
  region = meta_info$Region[match(hgdp$id, meta_info$Id)]
)

Let’s confirm that we have read all the expected information:

hgdp
#> # A gen_tibble: 643733 loci
#> # A tibble:     1,043 × 4
#>    id         genotypes population region
#>    <chr>     <vctr_SNP> <chr>      <chr> 
#>  1 HGDP00448          1 Biaka_HG   Africa
#>  2 HGDP00479          2 Biaka_HG   Africa
#>  3 HGDP00985          3 Biaka_HG   Africa
#>  4 HGDP01094          4 Biaka_HG   Africa
#>  5 HGDP00982          5 Mbuti_HG   Africa
#>  6 HGDP00911          6 Mandenka   Africa
#>  7 HGDP01202          7 Mandenka   Africa
#>  8 HGDP00927          8 Yoruba     Africa
#>  9 HGDP00461          9 Biaka_HG   Africa
#> 10 HGDP00451         10 Biaka_HG   Africa
#> # ℹ 1,033 more rows

Loci Report

We can then call qc_report_loci. This function supplies minor allele frequency, rate of missingness, and a Hardy-Weinberg exact p-value for each SNP.

loci_report <- qc_report_loci(hgdp)
#> This gen_tibble is not grouped. For Hardy-Weinberg equilibrium, `qc_report_loci()` will assume individuals are part of the same population and HWE test p-values will be calculated across all individuals. If you wish to calculate HWE p-values within populations or groups, please use`group_by()` before calling `qc_report_loci()`.

loci_report: 1.8s The resulting report can be observed using autoplot.

autoplot(loci_report, type = "all")
plot of chunk autoplot_loci_report
plot of chunk autoplot_loci_report

Filter Loci

Following this, we filter the loci to only including those with a minor allele frequency over 0.05, and a missingness rate below 0.05.

to_keep_loci <-
  subset(loci_report, loci_report$maf > 0.05 & loci_report$missingness < 0.05)
hgdp <- hgdp %>% select_loci(to_keep_loci$snp_id)

filter_loci: 1.4s

Individual Report

We can then call qc_report_indiv to supply observed heterozygosity per individual, and rate of missingness per individual.

indiv_report <- qc_report_indiv(hgdp)

indiv_report: 2.5s

autoplot(indiv_report, type = "scatter")
plot of chunk autoplot_indiv_report
plot of chunk autoplot_indiv_report

Filter individuals

And we can filter individuals down to only include those with less than 1% of their genotypes missing.

to_keep_indiv <- which(indiv_report$missingness < 0.01)
hgdp <- hgdp[to_keep_indiv, ]

filter_indiv: 3ms

Update backingfile

After removing individuals from the dataset, and before imputing, we need to update the file backing matrix with gt_update_backingfile.

hgdp <- gt_update_backingfile(hgdp)
#> Genetic distances are not sorted, setting them to zero
#> 
#> gen_backing files updated, now
#> using bigSNP file: /tmp/RtmpCu0Grc/test_47fa53a8faf7d_v2.rds
#> with backing file: /tmp/RtmpCu0Grc/test_47fa53a8faf7d_v2.bk
#> make sure that you do NOT delete those files!

Impute data

Some functions, such as loci_ld_clump and the gt_pca functions, require that there is no missingness in the dataset, so we use gt_impute_simple to impute any remaining missing genotypes.

hgdp <- gt_impute_simple(hgdp, method = "mode", n_cores = n_cores)
gt_set_imputed(hgdp, TRUE)

impute: 153ms

LD clumping

LD clumping is then performed to control for linkage disequilibrium.

hgdp <- hgdp %>%
  select_loci_if(loci_ld_clump(genotypes, thr_r2 = 0.2, n_cores = n_cores))

ld_clumping: 3.9s

PCA

A principal components analysis can then be computed using the resulting cleaned and LD clumped dataset.

test_pca <- hgdp %>% gt_pca_partialSVD()

pca: 1.8s

Plot PCA:

autoplot(test_pca, type = "scores") +
  aes(color = hgdp$region, shape = hgdp$region)
plot of chunk PCA
plot of chunk PCA

DAPC

We can continue with a discriminant analysis of principal components using gt_dapc, setting 6 groups corresponding to the main geographic regions covered by the dataset.

test_dapc <- gt_dapc(test_pca, pop = as.factor(hgdp$region))

dapc: 37ms

Plot DAPC:

autoplot(test_dapc, type = "scores")
plot of chunk DAPC_scores_plot
plot of chunk DAPC_scores_plot

Calculate Fst

To examine the differentiation between populations in the global HGDP set, we calculate pairwise Fst.

grouped_hgdp <- hgdp %>% group_by(population)
pairwise_fsts <- grouped_hgdp %>% pairwise_pop_fst(
  n_cores = n_cores,
  type = "pairwise"
)

pairwise_fst: 3s

Plot pairwise Fst:

# Order by continents
grouped_hgdp_order <- grouped_hgdp %>% arrange(region, population)
regional_order <- unique(grouped_hgdp_order$population)
pairwise_fsts <- pairwise_fsts[regional_order, regional_order]

ggheatmap(pairwise_fsts) +
  scale_fill_viridis_c() +
  theme(
    axis.text.x = element_text(angle = -60, hjust = 0, size = 6),
    axis.text.y = element_text(angle = 0, hjust = 1, size = 6)
  ) +
  labs(
    x = element_blank(),
    y = element_blank(),
    fill = "Fst"
  )

Finally, we can save the resulting cleaned dataset to a PLINK .bed file.

gt_as_plink(hgdp,
  file = tempfile(),
  type = "bed",
  overwrite = TRUE
)
#> [1] "/tmp/RtmpCu0Grc/file47fa51ed66b65.bed"

plink_save: 250ms

Summary

Here is a summary of the time taken (in seconds) to perform each step of the analyses:

#>            step  time
#> 1    read_plink  2.73
#> 2   loci_report   1.8
#> 3   filter_loci  1.35
#> 4  indiv_report  2.52
#> 5  filter_indiv     0
#> 6        impute  0.15
#> 7   ld_clumping  3.87
#> 8           pca  1.76
#> 9          dapc  0.04
#> 10 pairwise_fst  2.99
#> 11   plink_save  0.25
#> 12        Total 17.46

Running the benchmark on a laptop

We will now run this benchmark on a laptop with a GenuineIntel, Intel(R) Core(TM) Ultra 7 155H, 22 CPU and 31 GiB of RAM, limiting the number of cores to 4. Here is a summary of the time taken (in seconds) to perform each step of the analyses:

#>            step  time
#> 1    read_plink  5.31
#> 2   loci_report  3.46
#> 3   filter_loci  4.39
#> 4  indiv_report   3.3
#> 5  filter_indiv  0.01
#> 6        impute  0.96
#> 7   ld_clumping 10.09
#> 8           pca  3.94
#> 9          dapc  0.25
#> 10 pairwise_fst  5.35
#> 11   plink_save  0.47
#> 12        Total 37.53