Skip to contents

This vignette is a benchmark of a few key functions in the tidypopgen package. The Human Genome Diversity Project (HGDP) SNP dataset is used for this benchmark. This dataset includes 1043 individuals from 51 populations, typed at ~650k loci.

We will run this benchmark on a machine with a GenuineIntel, Intel(R) Core(TM) Ultra 7 155H, 22 CPU and 31 GiB of RAM. However, we will limit the number of cores used to 20. Parallelised functions within tidypopgen use an n_cores argument for the user to set. However, to prevent a behind-the-scenes inflation of the number of threads used (for example, in cases where dependency functions may automatically use all available cores) we need to set up preferences at the beginning of the session. Specifically, we limit the number of cores used by the parallelised BLAS library with bigparallelr::set_blas_ncores(), and by the package data.table with data.table::setDTthreads().

n_cores <- 20

data.table::setDTthreads(n_cores)
bigparallelr::set_blas_ncores(n_cores)

We can now load the necessary libraries:

library(tidypopgen)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tibble
library(ggplot2)

And before running the benchmark, we need to download the relevant files. We can use the following to download the HGDOP files into a temporary directory:

temp_dir <- tempdir()
download_url <- "https://zenodo.org/records/15582364/files/hgdp650_id_pop.txt"
download_path <- file.path(temp_dir, "hgdp650_id_pop.txt")
download.file(download_url, download_path, mode = "wb")
download_url <- "https://zenodo.org/records/15582364/files/hgdp650.qc.hg19.bed"
download_path <- file.path(temp_dir, "hgdp650.qc.hg19.bed")
download.file(download_url, download_path, method = "wget")
download_url <- "https://zenodo.org/records/15582364/files/hgdp650.qc.hg19.bim"
download_path <- file.path(temp_dir, "hgdp650.qc.hg19.bim")
download.file(download_url, download_path, method = "wget")
download_url <- "https://zenodo.org/records/15582364/files/hgdp650.qc.hg19.fam"
download_path <- file.path(temp_dir, "hgdp650.qc.hg19.fam")
download.file(download_url, download_path, method = "wget")

In the following vignette, we place these files into the subdirectory data/hgdp/, and we will use the paths to the bed file hgdp650.qc.hg19.bed and the metadata file hgdp650_id_pop.txt to read in our data.

bed_path <- "./data/hgdp/hgdp650.qc.hg19.bed"
meta_info <- readr::read_tsv("./data/hgdp/hgdp650_id_pop.txt")
#> Error: './data/hgdp/hgdp650_id_pop.txt' does not exist in current working directory ('/home/EJC199/Git/tidypopgen/vignettes/articles').

Create gen_tibble object

Our first step is to load the HGDP data into a gen_tibble object, and add its associated metadata.

hgdp <- gen_tibble(bed_path,
  quiet = TRUE,
  backingfile = tempfile("test_"),
  n_cores = n_cores
)

read_plink: 5.3s

Add metadata

hgdp <- hgdp %>% mutate(
  population = meta_info$population[match(hgdp$id, meta_info$Id)],
  region = meta_info$Region[match(hgdp$id, meta_info$Id)]
)
#> Error in `mutate()`:
#> ℹ In argument: `population = meta_info$population[match(hgdp$id, meta_info$Id)]`.
#> Caused by error:
#> ! object 'meta_info' not found

Let’s confirm that we have read all the expected information:

hgdp
#> # A gen_tibble: 643733 loci
#> # A tibble:     1,043 × 2
#>    id         genotypes
#>    <chr>     <vctr_SNP>
#>  1 HGDP00448          1
#>  2 HGDP00479          2
#>  3 HGDP00985          3
#>  4 HGDP01094          4
#>  5 HGDP00982          5
#>  6 HGDP00911          6
#>  7 HGDP01202          7
#>  8 HGDP00927          8
#>  9 HGDP00461          9
#> 10 HGDP00451         10
#> # ℹ 1,033 more rows

Loci Report

We can then call qc_report_loci. This function supplies minor allele frequency, rate of missingness, and a Hardy-Weinberg exact p-value for each SNP.

loci_report <- qc_report_loci(hgdp)
#> This gen_tibble is not grouped. For Hardy-Weinberg equilibrium, `qc_report_loci()` will assume individuals are part of the same population and HWE test p-values will be calculated across all individuals. If you wish to calculate HWE p-values within populations or groups, please use`group_by()` before calling `qc_report_loci()`.

loci_report: 3.3s The resulting report can be observed using autoplot.

autoplot(loci_report, type = "all")
plot of chunk autoplot_loci_report
plot of chunk autoplot_loci_report

Filter Loci

Following this, we filter the loci to only including those with a minor allele frequency over 0.05, and a missingness rate below 0.05.

to_keep_loci <-
  subset(loci_report, loci_report$maf > 0.05 & loci_report$missingness < 0.05)
hgdp <- hgdp %>% select_loci(to_keep_loci$snp_id)

filter_loci: 5.3s

Individual Report

We can then call qc_report_indiv to supply observed heterozygosity per individual, and rate of missingness per individual.

indiv_report <- qc_report_indiv(hgdp)

indiv_report: 3.4s

autoplot(indiv_report, type = "scatter")
plot of chunk autoplot_indiv_report
plot of chunk autoplot_indiv_report

Filter individuals

And we can filter individuals down to only include those with less than 1% of their genotypes missing.

to_keep_indiv <- which(indiv_report$missingness < 0.01)
hgdp <- hgdp[to_keep_indiv, ]

filter_indiv: 7ms

Update backingfile

After removing individuals from the dataset, and before imputing, we need to update the file backing matrix with gt_update_backingfile.

hgdp <- gt_update_backingfile(hgdp)
#> Genetic distances are not sorted, setting them to zero
#> 
#> gen_backing files updated, now
#> using bigSNP file: /tmp/RtmpY30jCS/test_163872732cbd_v2.rds
#> with backing file: /tmp/RtmpY30jCS/test_163872732cbd_v2.bk
#> make sure that you do NOT delete those files!

Impute data

Some functions, such as loci_ld_clump and the gt_pca functions, require that there is no missingness in the dataset, so we use gt_impute_simple to impute any remaining missing genotypes.

hgdp <- gt_impute_simple(hgdp, method = "mode", n_cores = n_cores)
gt_set_imputed(hgdp, TRUE)

impute: 365ms

LD clumping

LD clumping is then performed to control for linkage disequilibrium.

hgdp <- hgdp %>%
  select_loci_if(loci_ld_clump(genotypes, thr_r2 = 0.2, n_cores = n_cores))

ld_clumping: 7.5s

PCA

A principal components analysis can then be computed using the resulting cleaned and LD clumped dataset.

test_pca <- hgdp %>% gt_pca_partialSVD()

pca: 4s

Plot PCA:

autoplot(test_pca, type = "scores") +
  aes(color = hgdp$region, shape = hgdp$region)
#> Warning: Unknown or uninitialised column: `region`.
#> Unknown or uninitialised column: `region`.
plot of chunk PCA
plot of chunk PCA

DAPC

We can continue with a discriminant analysis of principal components using gt_dapc, setting 6 groups corresponding to the main geographic regions covered by the dataset.

test_dapc <- gt_dapc(test_pca, pop = as.factor(hgdp$region))
#> Warning: Unknown or uninitialised column: `region`.
#> Error in sweep(x$u, 2, x$d, "*")[, 1:n_pca, drop = FALSE]: only 0's may be mixed with negative subscripts

dapc: 24ms

Plot DAPC:

autoplot(test_dapc, type = "scores")
#> Error: object 'test_dapc' not found

Calculate Fst

To examine the differentiation between populations in the global HGDP set, we calculate pairwise Fst.

grouped_hgdp <- hgdp %>% group_by(population)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `population` is not found.
pairwise_fsts <- grouped_hgdp %>% pairwise_pop_fst(
  n_cores = n_cores,
  type = "pairwise"
)
#> Error: object 'grouped_hgdp' not found

pairwise_fst: 129ms

Plot pairwise Fst:

# Order by continents
grouped_hgdp_order <- grouped_hgdp %>% arrange(region, population)
regional_order <- unique(grouped_hgdp_order$population)
pairwise_fsts <- pairwise_fsts[regional_order, regional_order]

ggheatmap(pairwise_fsts) +
  scale_fill_viridis_c() +
  theme(
    axis.text.x = element_text(angle = -60, hjust = 0, size = 6),
    axis.text.y = element_text(angle = 0, hjust = 1, size = 6)
  ) +
  labs(
    x = element_blank(),
    y = element_blank(),
    fill = "Fst"
  )

Finally, we can save the resulting cleaned dataset to a PLINK .bed file.

gt_as_plink(hgdp,
  file = tempfile(),
  type = "bed",
  overwrite = TRUE
)
#> [1] "/tmp/RtmpY30jCS/file1638575879aa.bed"

plink_save: 485ms

Summary

Here is a summary of the time taken (in seconds) to perform each step of the analyses:

#>            step  time
#> 1    read_plink  5.29
#> 2   loci_report  3.29
#> 3   filter_loci  5.29
#> 4  indiv_report  3.44
#> 5  filter_indiv  0.01
#> 6        impute  0.36
#> 7   ld_clumping  7.54
#> 8           pca  3.99
#> 9          dapc  0.02
#> 10 pairwise_fst  0.13
#> 11   plink_save  0.48
#> 12        Total 29.84