In this vignette, we illustrate how to troubleshoot tuning errors. This is not a comprehensive list (yet), but rather an attempt to illustrate how an error can be approached.
NAs in the data
Most algorithms do not allow NAs. We can generate a problematic dataset by loading the Lacerta dataset, and manually add an NA:
library(tidysdm)
#> Loading required package: tidymodels
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom 1.0.7 ✔ recipes 1.1.0
#> ✔ dials 1.3.0 ✔ rsample 1.2.1
#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1
#> ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
#> ✔ infer 1.0.7 ✔ tune 1.2.1
#> ✔ modeldata 1.4.0 ✔ workflows 1.1.4
#> ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
#> ✔ purrr 1.0.2 ✔ yardstick 1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ recipes::step() masks stats::step()
#> • Dig deeper into tidy modeling with R at https://www.tmwr.org
#> Loading required package: spatialsample
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS",
package = "tidysdm"
))
lacerta_thin$bio05[37] <- NA
Let us set up a recipe and fit workflow_set
lacerta_rec <- recipe(lacerta_thin, formula = class ~ .) %>%
step_rm(all_of(c(
"bio01", "bio02", "bio03", "bio04", "bio07", "bio08",
"bio09", "bio10", "bio11", "bio12", "bio14", "bio16",
"bio17", "bio18", "bio19", "altitude"
)))
lacerta_models <-
# create the workflow_set
workflow_set(
preproc = list(default = lacerta_rec),
models = list(
# the standard glm specs
glm = sdm_spec_glm(),
# rf specs with tuning
rf = sdm_spec_rf()
),
# make all combinations of preproc and models,
cross = TRUE
) %>%
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
set.seed(100)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5)
lacerta_models <-
lacerta_models %>%
workflow_map("tune_grid",
resamples = lacerta_cv, grid = 3,
metrics = sdm_metric_set(), verbose = TRUE
)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> ✔ 1 of 2 resampling: default_glm (318ms)
#> i 2 of 2 tuning: default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> → A | error: Missing data in columns: bio05.
#> There were issues with some computations A: x1
#> There were issues with some computations A: x15
#>
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> Warning: Unknown or uninitialised column: `.notes`.
#> ✖ 2 of 2 tuning: default_rf failed with
We can see that the error is self-explanatory. Also, note that the
error impacts all data splits (technically, rset
objects):
error A is repeated 15 times (5 splits for 3 hyperparameter values).
Prepping the recipe (which trains it on the dataset) can help diagnosing problems:
lacerta_prep <- lacerta_rec %>% prep(lacerta_thin)
lacerta_prep
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 20
#> coords: 2
#>
#> ── Training information
#> Training data contained 452 data points and 1 incomplete row.
#>
#> ── Operations
#> • Variables removed: bio01, bio02, bio03, bio04, bio07, bio08, ... | Trained
Note that, in the training information, we were warned that 1 row was
incomplete. You could use step_naomit
to deal with this
programmatically, or ascertain why you are generating missing data (we
prefer the latter, as a good SDM pipeline should not generate
observations, presences or pseudoabsences, with missing data).
Recipes and the response variable
The response variable is treated in a special way in
recipes
, and this can lead to problems. It is best not to
manipulate (e.g. transform character into factor) the response variable
in a recipe, since that response variable will only be available when we
train and test models, but not when we make projections. If we
hard-coded a step in a recipe that includes the response variable, the
model will fit, but then it will fail when we start making
predictions.
Another potential mistake is to remove the response variable when
selecting variables of interest. This can happen if we use
step_select
to choose variables of interest, and the error
is less than clear:
Let’s load the data and create a recipe with
step_select
:
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS",
package = "tidysdm"
))
suggested_vars <- c("bio05", "bio06", "bio13", "bio14", "bio15")
lacerta_rec_sel <- recipe(lacerta_thin, formula = class ~ .) %>%
step_select(all_of(suggested_vars))
Now we create the workflow set and fit it:
lacerta_models <-
# create the workflow_set
workflow_set(
preproc = list(default = lacerta_rec_sel),
models = list(
# the standard glm specs
glm = sdm_spec_glm(),
# rf specs with tuning
rf = sdm_spec_rf()
),
# make all combinations of preproc and models,
cross = TRUE
) %>%
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
set.seed(100)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5)
lacerta_models <-
lacerta_models %>%
workflow_map("tune_grid",
resamples = lacerta_cv, grid = 3,
metrics = sdm_metric_set(), verbose = TRUE
)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> → A | error: ! `logistic_reg()` was unable to find an outcome.
#> ℹ Ensure that you have specified an outcome column and that it hasn't been
#> removed in pre-processing.
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> Warning: Unknown or uninitialised column: `.notes`.
#> ✖ 1 of 2 resampling: default_glm failed with
#> i 2 of 2 tuning: default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> → A | error: ! `rand_forest()` was unable to find an outcome.
#> ℹ Ensure that you have specified an outcome column and that it hasn't been
#> removed in pre-processing.
#> There were issues with some computations A: x3
#> There were issues with some computations A: x15
#>
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more information.
#> Unknown or uninitialised column: `.notes`.
#> ✖ 2 of 2 tuning: default_rf failed with
The errors are not very intuitive. However, all models have failed for all algorithms, which suggests that the problem lies with the data preparation side (either the data themselves, or what we did with the recipe).
Ideally, you should have already had a look at your data (with
summary
or glimpse
). So, in this case, we know
that the data are fine. Whilst prepping (and sometimes baking) the
recipe is generally informative for predictor variables, it is hard to
diagnose problems with the outcome variable in a recipe. Prepping will
not show anything obvious:
lacerta_prep_sel <- lacerta_rec_sel %>% prep(lacerta_thin)
lacerta_prep_sel
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 20
#> coords: 2
#>
#> ── Training information
#> Training data contained 452 data points and no incomplete rows.
#>
#> ── Operations
#> • Variables selected: bio05, bio06, bio13, bio14, bio15 | Trained
In this case, it is a process of exclusion. Everything seems fine,
but the models don’t work. Then ask yourself if the outcome variable
might be problematic. As a general rule, we have found it easier to rely
on step_rm
to remove variables (e.g. correlated
variables).
Using the desired formula with GAM
General Additive Models have an unusual syntax, as the user has to
define which variables are fitted with splines. tidysdm
has
some functions to simplify this process, assuming that the user just
wants to fit a standard smooth to every continuous predictor.
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS",
package = "tidysdm"
))
lacerta_rec <- recipe(lacerta_thin, formula = class ~ .) %>%
step_rm(all_of(c(
"bio01", "bio02", "bio03", "bio04", "bio07", "bio08",
"bio09", "bio10", "bio11", "bio12", "bio14", "bio16",
"bio17", "bio18", "bio19", "altitude"
)))
lacerta_models <-
# create the workflow_set
workflow_set(
preproc = list(default = lacerta_rec),
models = list(
# the standard glm specs
glm = sdm_spec_glm(),
# the standard gam specs
gam = sdm_spec_gam()
),
# make all combinations of preproc and models,
cross = TRUE
) %>%
# set formula for gams
update_workflow_model("default_gam",
spec = sdm_spec_gam(),
formula = gam_formula(lacerta_rec)
) %>%
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
set.seed(100)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5)
lacerta_models <-
lacerta_models %>%
workflow_map("tune_grid",
resamples = lacerta_cv, grid = 3,
metrics = sdm_metric_set(), verbose = TRUE
)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> ✔ 1 of 2 resampling: default_glm (273ms)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 2 of 2 resampling: default_gam
#> ✔ 2 of 2 resampling: default_gam (1.4s)
Note that the step of defining a formula is incompatible with using
step_cor
in a recipe. step_cor
removes
correlated variables in recipes, using a similar algorithm to
filter_collinear
using method cor_caret
.
However, the algorithm is fitted to each data split when
cross-validating. This means that different variables will eventually be
presented to the model when it is fitted for each split, leading to an
error as there will be a mismatch between the formula and the available
variables. This is a known issue of how GAMs are implemented in
tidymodels
.
When only some splits fail
In the examples above, all the splits used for cross-validation of a
given algorithms failed. However, it is also possible that failures
occur only on some splits for certain algorithms (technically, a
specific rsplit
within certain workflows
).
When this type of problem occurs, it is best to extract the problematic
workflow, and potentially investigate fitting it to the specific
rsplit
.
We generate a problematic dataset by subsampling the lacerta dataset:
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS",
package = "tidysdm"
))
set.seed(123)
lacerta_thin <- lacerta_thin[sample(
1:nrow(lacerta_thin),
nrow(lacerta_thin) / 5
), ]
lacerta_rec <- recipe(lacerta_thin, formula = class ~ .) %>%
step_rm(all_of(c(
"bio01", "bio02", "bio03", "bio04", "bio07", "bio08",
"bio09", "bio10", "bio11", "bio12", "bio14", "bio16",
"bio17", "bio18", "bio19", "altitude"
)))
lacerta_models <-
# create the workflow_set
workflow_set(
preproc = list(default = lacerta_rec),
models = list(
# the standard glm specs
glm = sdm_spec_glm(),
# the standard gam specs
gam = sdm_spec_gam(),
# rf specs with tuning
rf = sdm_spec_rf()
),
# make all combinations of preproc and models,
cross = TRUE
) %>%
# set formula for gams
update_workflow_model("default_gam",
spec = sdm_spec_gam(),
formula = gam_formula(lacerta_rec)
) %>%
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
We then create 3 folds and attempt to fit the models:
set.seed(100)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 3)
lacerta_models <-
lacerta_models %>%
workflow_map("tune_grid",
resamples = lacerta_cv, grid = 3,
metrics = sdm_metric_set(), verbose = TRUE
)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 3 resampling: default_glm
#> ✔ 1 of 3 resampling: default_glm (183ms)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 2 of 3 resampling: default_gam
#> → A | warning: Fitting terminated with step failure - check results carefully
#> There were issues with some computations A: x1
#> There were issues with some computations A: x1
#>
#> ✔ 2 of 3 resampling: default_gam (1.6s)
#> i 3 of 3 tuning: default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 3 of 3 tuning: default_rf (458ms)
We see that one of the folds gives us an error when using GAMs. The
error (“Fitting terminated with step failure - check results carefully”)
comes from the gam function in the package mgcv
. A quick
google on StackOverflow[https://stats.stackexchange.com/questions/576273/gam-model-warning-message-step-failure-in-theta-estimation]
gives us an idea of where this error comes from.
We start by extracting the results of the gam fits:
gam_results <- extract_workflow_set_result(lacerta_models, id = "default_gam")
gam_results
#> # Resampling results
#> # 3-fold spatial block cross-validation
#> # A tibble: 3 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [54/36]> Fold1 <tibble [3 × 4]> <tibble [0 × 3]> <tibble [36 × 5]>
#> 2 <split [63/27]> Fold2 <tibble [3 × 4]> <tibble [1 × 3]> <tibble [27 × 5]>
#> 3 <split [64/26]> Fold3 <tibble [3 × 4]> <tibble [0 × 3]> <tibble [26 × 5]>
#>
#> There were issues with some computations:
#>
#> - Warning(s) x1: Fitting terminated with step failure - check results carefully
#>
#> Run `show_notes(.Last.tune.result)` for more information.
We see that, in the .notes
column, the second item is
not empty (it does not have zero rows). We can check that it indeed
contains the error that we wanted:
gam_results$.notes[2]
#> [[1]]
#> # A tibble: 1 × 3
#> location type note
#> <chr> <chr> <chr>
#> 1 preprocessor 1/1, model 1/1 warning Fitting terminated with step failure - ch…
We can now get the problematic data split, and extract the training data:
problem_split <- gam_results$splits[2][[1]]
summary(training(problem_split))
#> class geometry bio01 bio02
#> presence :18 POINT :63 Min. : 4.74 Min. : 6.737
#> pseudoabs:45 epsg:4326 : 0 1st Qu.:11.81 1st Qu.: 9.336
#> +proj=long...: 0 Median :13.09 Median :10.937
#> Mean :12.88 Mean :11.052
#> 3rd Qu.:14.82 3rd Qu.:12.649
#> Max. :17.87 Max. :14.037
#> bio03 bio04 bio05 bio06
#> Min. :34.30 Min. :341.2 Min. :19.90 Min. :-6.2732
#> 1st Qu.:39.30 1st Qu.:500.8 1st Qu.:24.91 1st Qu.:-0.6787
#> Median :40.55 Median :610.8 Median :28.59 Median : 1.1918
#> Mean :40.54 Mean :584.6 Mean :28.57 Mean : 1.2175
#> 3rd Qu.:42.19 3rd Qu.:656.1 3rd Qu.:32.31 3rd Qu.: 3.5664
#> Max. :46.98 Max. :756.7 Max. :35.31 Max. : 8.2344
#> bio07 bio08 bio09 bio10
#> Min. :16.40 Min. : 1.922 Min. : 1.588 Min. :12.86
#> 1st Qu.:23.32 1st Qu.: 7.716 1st Qu.:16.995 1st Qu.:18.53
#> Median :27.88 Median : 9.668 Median :19.828 Median :20.51
#> Mean :27.35 Mean : 9.450 Mean :18.938 Mean :20.48
#> 3rd Qu.:31.49 3rd Qu.:11.341 3rd Qu.:22.607 3rd Qu.:23.08
#> Max. :35.27 Max. :16.882 Max. :25.470 Max. :25.71
#> bio11 bio12 bio13 bio14
#> Min. :-2.060 Min. : 249.0 Min. : 36.0 Min. : 2.00
#> 1st Qu.: 4.968 1st Qu.: 452.0 1st Qu.: 59.0 1st Qu.: 8.00
#> Median : 6.236 Median : 628.0 Median : 91.0 Median :17.00
#> Mean : 6.268 Mean : 757.8 Mean :101.5 Mean :21.97
#> 3rd Qu.: 8.455 3rd Qu.:1016.5 3rd Qu.:119.0 3rd Qu.:30.50
#> Max. :11.795 Max. :1622.0 Max. :248.0 Max. :74.00
#> bio15 bio16 bio17 bio18
#> Min. :13.44 Min. : 96.0 Min. : 17.00 Min. : 22.0
#> 1st Qu.:30.07 1st Qu.:157.0 1st Qu.: 43.00 1st Qu.: 47.0
#> Median :38.97 Median :249.0 Median : 71.00 Median : 78.0
#> Mean :41.58 Mean :280.3 Mean : 88.08 Mean : 96.0
#> 3rd Qu.:54.30 3rd Qu.:334.0 3rd Qu.:109.50 3rd Qu.:117.5
#> Max. :71.59 Max. :714.0 Max. :253.00 Max. :253.0
#> bio19 altitude
#> Min. : 68.0 Min. : 38.0
#> 1st Qu.:128.5 1st Qu.: 319.5
#> Median :225.0 Median : 689.0
#> Mean :252.5 Mean : 685.5
#> 3rd Qu.:319.5 3rd Qu.: 855.0
#> Max. :714.0 Max. :1926.0
In this case, there is nothing too obvious that leads to the error (an important check is to make sure that you have enough presences in a split; too few presences will generally lead to errors).
We can now extract the workflow and refit it to the split to confirm that we have isolated the problem:
gam_workflow <- extract_workflow(lacerta_models, id = "default_gam")
faulty_gam <- fit(gam_workflow, training(problem_split))
#> Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L,
#> : Fitting terminated with step failure - check results carefully
The next step would be to dig deeper into the data, trying to understand whether there are some outliers that are problematic. The specific steps will depend on the algorithm that is giving problems.