
Deprecated: Filter to retain only variables below a given correlation threshold
Source:R/filter_high_cor.R
filter_high_cor.Rd
THIS FUNCTION IS DEPRECATED. USE filter_collinear
with method=cor_caret
instead
Usage
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
# Default S3 method
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
# S3 method for class 'SpatRaster'
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
# S3 method for class 'data.frame'
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
# S3 method for class 'matrix'
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
Arguments
- x
A
terra::SpatRaster
object, a data.frame (with only numeric variables), or a correlation matrix- cutoff
A numeric value for the pair-wise absolute correlation cutoff
- verbose
A boolean for printing the details
- names
a logical; should the column names be returned
TRUE
or the column indexFALSE
)?- to_keep
A vector of variable names that we want to force in the set (note that the function will return an error if the correlation among any of those variables is higher than the cutoff).
Value
A vector of names of columns that are below the correlation
threshold (when names = TRUE
), otherwise a vector of indices. Note
that the indices are only for numeric variables (i.e. if factors are
present, the indices do not take them into account).
Details
This method finds a subset of variable such that all have a correlation below
a certain cutoff. There are methods for terra::SpatRaster
,
data.frame
, and to work directly on a correlation matrix that was
previously estimated. For data.frame
, only numeric variables will be
considered. The algorithm is based on caret::findCorrelation
, using the
exact
option. The absolute values of pair-wise correlations are considered.
If two variables have a high correlation, the function looks at the mean
absolute correlation of each variable and removes the variable with the
largest mean absolute correlation.
There are several function in the package subselect
that can also be used
to accomplish the same goal but tend to retain more predictors.