Title: | Tidy Modelling for Nested Data |
---|---|
Description: | A modelling framework for nested data using the 'tidymodels' ecosystem. Specify how to nest data using the 'recipes' package, create testing and training splits using 'rsample', and fit models to this data using the 'parsnip' and 'workflows' packages. Allows any model to be fit to nested data. |
Authors: | Ashby Thorpe [aut, cre] , Hadley Wickham [ctb] |
Maintainer: | Ashby Thorpe <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0.9000 |
Built: | 2025-01-22 04:29:00 UTC |
Source: | https://github.com/ashbythorpe/nestedmodels |
generics::augment()
method for nested models. augment.nested_model_fit()
will add column(s) for predictions to the given data.
## S3 method for class 'nested_model_fit' augment(x, new_data, ...)
## S3 method for class 'nested_model_fit' augment(x, new_data, ...)
x |
A |
new_data |
A data frame - can be nested or non-nested. |
... |
Passed onto |
A data frame with one or more added columns for predictions.
library(dplyr) library(tidyr) library(parsnip) data <- filter(example_nested_data, id %in% 1:5) nested_data <- nest(data, data = -c(id, id2)) model <- linear_reg() %>% set_engine("lm") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) augment(fitted, example_nested_data)
library(dplyr) library(tidyr) library(parsnip) data <- filter(example_nested_data, id %in% 1:5) nested_data <- nest(data, data = -c(id, id2)) model <- linear_reg() %>% set_engine("lm") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) augment(fitted, example_nested_data)
This method calls parsnip::autoplot.model_fit()
on each model fitted on
each nested data frame, returning a list of plots.
## S3 method for class 'nested_model_fit' autoplot(object, ...)
## S3 method for class 'nested_model_fit' autoplot(object, ...)
object |
A |
... |
Passed into |
Printing the list of plots will print every plot in turn, so remember to store the result of this function in a variable to look at each plot individually.
A list of ggplot2::ggplot()
objects.
library(dplyr) library(tidyr) library(purrr) library(parsnip) library(glmnet) library(ggplot2) data <- filter(example_nested_data, id %in% 16:20) nested_data <- nest(data, data = -id2) model <- linear_reg(penalty = 1) %>% set_engine("glmnet") %>% nested() fit <- fit(model, z ~ x + y + a + b, nested_data) plots <- autoplot(fit) # View the first plot plots[[1]] # Use the patchwork package (or others) to combine the plots library(patchwork) reduce(plots, `+`)
library(dplyr) library(tidyr) library(purrr) library(parsnip) library(glmnet) library(ggplot2) data <- filter(example_nested_data, id %in% 16:20) nested_data <- nest(data, data = -id2) model <- linear_reg(penalty = 1) %>% set_engine("glmnet") %>% nested() fit <- fit(model, z ~ x + y + a + b, nested_data) plots <- autoplot(fit) # View the first plot plots[[1]] # Use the patchwork package (or others) to combine the plots library(patchwork) reduce(plots, `+`)
A dataset containing example data that can be nested. Mainly used for examples and testing.
example_nested_data
example_nested_data
A tibble with 1000 rows and 7 variables
A column that can be nested, ranging from 1 to 20.
Another column that can be nested, ranging from 1 to 10.
A numeric column that depends on 'id'.
A sequential numeric column (with some added randomness), independent of the other columns.
A column dependent on id, id2, x and y.
A randomly generated numeric column, ranging from 1 to 100.
A randomly generated numeric column, centred around 50.
example_nested_data
example_nested_data
Extract the inner model of a nested_model
object, or a workflow
containing a nested model.
extract_inner_model(x, ...) ## Default S3 method: extract_inner_model(x, ...) ## S3 method for class 'nested_model' extract_inner_model(x, ...) ## S3 method for class 'workflow' extract_inner_model(x, ...) ## S3 method for class 'model_spec' extract_inner_model(x, ...)
extract_inner_model(x, ...) ## Default S3 method: extract_inner_model(x, ...) ## S3 method for class 'nested_model' extract_inner_model(x, ...) ## S3 method for class 'workflow' extract_inner_model(x, ...) ## S3 method for class 'model_spec' extract_inner_model(x, ...)
x |
A model spec or workflow. |
... |
Not used. |
A model_spec
object
library(parsnip) model <- linear_reg() %>% set_engine("lm") %>% nested() extract_inner_model(model)
library(parsnip) model <- linear_reg() %>% set_engine("lm") %>% nested() extract_inner_model(model)
generics::fit_xy()
method for nested models. This should not be
called directly and instead should be called by
workflows::fit.workflow()
.
## S3 method for class 'nested_model' fit_xy( object, x, y, case_weights = NULL, control = parsnip::control_parsnip(), ... )
## S3 method for class 'nested_model' fit_xy( object, x, y, case_weights = NULL, control = parsnip::control_parsnip(), ... )
object |
An |
x |
A data frame of predictors. |
y |
A data frame of outcome data. |
case_weights |
An optional vector of case weights. Passed into
|
control |
A |
... |
Passed into |
A nested_model_fit
object with several elements:
spec
: The model specification object (the inner model of the
nested model object)
fit
: A tibble containing the model fits and the nests that they
correspond to.
inner_names
: A character vector of names, used to help with
nesting the data during predictions.
parsnip::fit.model_spec()
parsnip::model_fit
library(dplyr) library(parsnip) library(recipes) library(workflows) data <- filter(example_nested_data, id %in% 11:20) model <- linear_reg() %>% set_engine("lm") %>% nested() recipe <- recipe(data, z ~ x + y + id) %>% step_nest(id) wf <- workflow() %>% add_recipe(recipe) %>% add_model(model) fit(wf, data)
library(dplyr) library(parsnip) library(recipes) library(workflows) data <- filter(example_nested_data, id %in% 11:20) model <- linear_reg() %>% set_engine("lm") %>% nested() recipe <- recipe(data, z ~ x + y + id) %>% step_nest(id) wf <- workflow() %>% add_recipe(recipe) %>% add_model(model) fit(wf, data)
fit.model_spec()
takes a nested model specification and fits the inner
model specification to each nested data frame in the given dataset.
## S3 method for class 'nested_model' fit( object, formula, data, case_weights = NULL, control = parsnip::control_parsnip(), ... )
## S3 method for class 'nested_model' fit( object, formula, data, case_weights = NULL, control = parsnip::control_parsnip(), ... )
object |
An |
formula |
An object of class |
data |
A data frame. If used with a 'nested_model' object, the data frame must already be nested. |
case_weights |
An optional vector of case weights. Passed into
|
control |
A |
... |
Passed into |
A nested_model_fit
object with several elements:
spec
: The model specification object (the inner model of the
nested model object)
fit
: A tibble containing the model fits and the nests that they
correspond to.
inner_names
: A character vector of names, used to help with
nesting the data during predictions.
parsnip::fit.model_spec()
parsnip::model_fit
library(parsnip) library(tidyr) model <- linear_reg() %>% set_engine("lm") %>% nested() nested_data <- nest(example_nested_data, data = -id) fit(model, z ~ x + y + a + b, nested_data)
library(parsnip) library(tidyr) model <- linear_reg() %>% set_engine("lm") %>% nested() nested_data <- nest(example_nested_data, data = -id) fit(model, z ~ x + y + a + b, nested_data)
parsnip::multi_predict()
method for nested models. Allows predictions
to be made on sub-models in a model object.
## S3 method for class 'nested_model_fit' multi_predict(object, new_data, ...)
## S3 method for class 'nested_model_fit' multi_predict(object, new_data, ...)
object |
A |
new_data |
A data frame - can be nested or non-nested. |
... |
Passed onto |
A tibble with the same number of rows as new_data
, after it
has been unnested.
library(dplyr) library(tidyr) library(parsnip) library(glmnet) data <- filter(example_nested_data, id %in% 16:20) nested_data <- nest(data, data = -id2) model <- linear_reg(penalty = 1) %>% set_engine("glmnet") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) multi_predict(fitted, example_nested_data, penalty = c(0.1, 0.2, 0.3) )
library(dplyr) library(tidyr) library(parsnip) library(glmnet) data <- filter(example_nested_data, id %in% 16:20) nested_data <- nest(data, data = -id2) model <- linear_reg(penalty = 1) %>% set_engine("glmnet") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) multi_predict(fitted, example_nested_data, penalty = c(0.1, 0.2, 0.3) )
nested()
turns a model or workflow into a nested model/workflow.
is_nested()
checks if a model or workflow is nested.
nested(x, ...) is_nested(x, ...) ## Default S3 method: nested(x, ...) ## S3 method for class 'model_spec' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## S3 method for class 'nested_model' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## S3 method for class 'workflow' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## Default S3 method: is_nested(x, ...) ## S3 method for class 'model_spec' is_nested(x, ...) ## S3 method for class 'workflow' is_nested(x, ...)
nested(x, ...) is_nested(x, ...) ## Default S3 method: nested(x, ...) ## S3 method for class 'model_spec' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## S3 method for class 'nested_model' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## S3 method for class 'workflow' nested(x, allow_par = FALSE, pkgs = NULL, ...) ## Default S3 method: is_nested(x, ...) ## S3 method for class 'model_spec' is_nested(x, ...) ## S3 method for class 'workflow' is_nested(x, ...)
x |
A model specification or workflow. |
... |
Not currently used. |
allow_par |
A logical to allow parallel processing over nests during the fitting process (if a parallel backend is registered). |
pkgs |
An optional character string of R package names that should be loaded (by namespace) during parallel processing. |
A nested model object, or a workflow containing a nested model.
For is_nested()
, a logical vector of length 1.
library(parsnip) library(workflows) model <- linear_reg() %>% set_engine("lm") %>% nested() model is_nested(model) wf <- workflow() %>% add_model(model) is_nested(wf)
library(parsnip) library(workflows) model <- linear_reg() %>% set_engine("lm") %>% nested() model is_nested(model) wf <- workflow() %>% add_model(model) is_nested(wf)
Use any rsample split function on nested data, where nests act as strata. This almost guarantees that every split will contain data from every nested data frame.
nested_resamples( data, resamples, nesting_method = NULL, size_action = c("truncate", "recycle", "recycle-random", "combine", "combine-random", "combine-end", "error"), ... )
nested_resamples( data, resamples, nesting_method = NULL, size_action = c("truncate", "recycle", "recycle-random", "combine", "combine-random", "combine-end", "error"), ... )
data |
A data frame. |
resamples |
An expression, function, formula or string that can
be evaluated to produce an |
nesting_method |
A recipe, workflow or |
size_action |
If different numbers of splits are produced in each nest, how should sizes be matched? (see Details) |
... |
Extra arguments to pass into |
This function breaks down a data frame into smaller, nested data frames.
Resampling is then performed within these nests, and the results are
combined together at the end. This ensures that each split contains
data from every nest. However, this function does not perform any
pooling (unlike rsample::make_strata()
), so you may run into issues
if a nest is too small.
Either an rsplit
object or an rset
object, depending on
resamples
.
data
can be nested in several ways:
If nesting_method
is NULL
and data
is grouped (using
dplyr::group_by()
), the data will be nested (see tidyr::nest()
for how this works).
If data
is not grouped, it is assumed to already be nested, and
nested_resamples
will try to find a column that contains nested data
frames.
If nesting_method
is a workflow or recipe, and the recipe has a step
created using step_nest()
, data
will be nested using the step in
the recipe. This is convenient if you've already created a recipe or
workflow, as it saves a line of code.
The resamples
argument can take many forms:
A function call, such as vfold_cv(v = 5)
. This is similar to the
format of rsample::nested_cv()
.
A function, such as rsample::vfold_cv
.
A purrr-style anonymous function, which will be converted to a
function using rlang::as_function()
.
A string, which will be evaluated using rlang::exec()
.
Every method will be evaluated with data
passed in as the first
argument (with name 'data').
Before the set of resamples created in each nest can be combined, they
must contain the same number of splits. For most resampling methods,
this will not be an issue. rsample::vfold_cv()
, for example, reliably
creates the number of splits defined in its 'v' argument. However,
other resampling methods, like rsample::rolling_origin()
, depend on
the size of their 'data' argument, and therefore may produce different
numbers of resamples when presented with differently sized nests.
The size_action
argument defines many ways of matching the sizes of
resample sets with different numbers of splits. These methods will either try
to reduce the number of splits in each set until each rset is the same
length as the set with the lowest number of splits; or the opposite,
where each rset will have the same number of splits as the largest set.
"truncate", the default, means that all splits beyond the required length will be removed.
"recycle" means that sets of splits will be extended by repeating elements until the required length has been reached, mimicking the process of vector recycling. The advantage of this method is that all created splits will be preserved.
"recycle-random" is a similar process to recycling, but splits will be copied at random to spaces in the output, which may be important if the order of resamples matters. This process is not completely random, and the program makes sure that every split is copied roughly the same number of times.
"combine" gets rid of excess splits by combining them with previous ones. This means the training and testing rows are merged into one split. Combining is done systematically: if a set of splits needs to be compacted down to a set of 5, the first split is combined with the sixth split, then the eleventh, then the sixteenth, etc. This approach is not recommended, since it is not clear what the benefit of a combined split is.
"combine-random" combines each split with a random set of other splits, instead of the systematic process described in the previous method. Once again, this process is not actually random, and each split will be combined with roughly the same number of other splits.
"combine-end" combines every excess split with the last non-excess split.
"error" throws an error if each nest does not produce the same number of splits.
rsample::initial_split()
for an example of the strata
argument.
library(tidyr) library(recipes) library(workflows) library(rsample) library(dplyr) nested_data <- example_nested_data %>% nest(data = -id) grouped_data <- example_nested_data %>% group_by(id) recipe <- recipe(example_nested_data, z ~ .) %>% step_nest(id) wf <- workflow() %>% add_recipe(recipe) nested_resamples(nested_data, vfold_cv()) nested_resamples( group_by(example_nested_data, id), ~ initial_split(.) ) nested_resamples( example_nested_data, initial_validation_split, nesting_method = recipe ) nested_resamples(example_nested_data, bootstraps, times = 25, nesting_method = wf ) # nested nested resamples nested_resamples(nested_data, nested_cv( vfold_cv(), bootstraps() ))
library(tidyr) library(recipes) library(workflows) library(rsample) library(dplyr) nested_data <- example_nested_data %>% nest(data = -id) grouped_data <- example_nested_data %>% group_by(id) recipe <- recipe(example_nested_data, z ~ .) %>% step_nest(id) wf <- workflow() %>% add_recipe(recipe) nested_resamples(nested_data, vfold_cv()) nested_resamples( group_by(example_nested_data, id), ~ initial_split(.) ) nested_resamples( example_nested_data, initial_validation_split, nesting_method = recipe ) nested_resamples(example_nested_data, bootstraps, times = 25, nesting_method = wf ) # nested nested resamples nested_resamples(nested_data, nested_cv( vfold_cv(), bootstraps() ))
Apply a fitted nested model to generate different types of predictions.
stats::predict()
/ parsnip::predict_raw()
methods for nested model fits.
## S3 method for class 'nested_model_fit' predict(object, new_data, type = NULL, opts = list(), ...) ## S3 method for class 'nested_model_fit' predict_raw(object, new_data, opts = list(), ...)
## S3 method for class 'nested_model_fit' predict(object, new_data, type = NULL, opts = list(), ...) ## S3 method for class 'nested_model_fit' predict_raw(object, new_data, opts = list(), ...)
object |
A |
new_data |
A data frame to make predictions on. Can be nested or non-nested. |
type |
A singular character vector or |
opts |
A list of optional arguments. Passed on to
|
... |
Arguments for the underlying model's predict function. Passed on
to |
A data frame of model predictions. For predict_raw()
, a
matrix, data frame, vector or list.
library(dplyr) library(tidyr) library(parsnip) data <- filter(example_nested_data, id %in% 5:15) nested_data <- nest(data, data = -id) model <- linear_reg() %>% set_engine("lm") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) predict(fitted, example_nested_data) predict_raw(fitted, example_nested_data)
library(dplyr) library(tidyr) library(parsnip) data <- filter(example_nested_data, id %in% 5:15) nested_data <- nest(data, data = -id) model <- linear_reg() %>% set_engine("lm") %>% nested() fitted <- fit(model, z ~ x + y + a + b, nested_data) predict(fitted, example_nested_data) predict_raw(fitted, example_nested_data)
step_nest()
creates a specification of a recipe step that will
convert specified data into a single model term, specifying the 'nest'
that each row of the dataset corresponds to.
step_nest( recipe, ..., role = "predictor", trained = FALSE, names = NULL, lookup_table = NULL, skip = FALSE, id = recipes::rand_id("nest") )
step_nest( recipe, ..., role = "predictor", trained = FALSE, names = NULL, lookup_table = NULL, skip = FALSE, id = recipes::rand_id("nest") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables.
For |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
names |
The names of the variables selected by |
lookup_table |
The table describing which values of your selected
columns correspond to which unique nest id are stored here once this
preprocessing step has been trained by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_nest()
will create a single nominal variable (named '.nest_id')
from a set of variables (of any type). Every unique combination
of the specified columns will receive a single nest id.
This recipe step is designed for use with nested models, since a model will be fitted on the data corresponding to each nest id. Using a recipe is often easier and more reliable than nesting the data manually.
The nest id corresponding to each unique combination of column values is
decided when the recipe is prepped (if this recipe is contained in a
workflow, this happens when the workflow is fitted). This means that
when using a prepped recipe on new data (using recipes::prep()
or
workflows::predict.workflow()
), all unique combinations of nesting
columns must also exist in the training data. You will be warned if
this is not the case. If you are using the 'rsample' package to create
splits and this presents an issue, you may want to consider using
nested_resamples()
.
step_nest()
is designed so that nesting the transformed data by its
'.nest_id' column is equivalent to the following action on the
non-transformed data:
data %>% dplyr::group_by(...) %>% # '...' represents your specified terms tidyr::nest()
An updated version of recipe with the new step added to the sequence of any existing operations.
When you tidy()
this step, a tibble is returned showing
how each unique value of the terms you have specified correspond to each
nest id.
The underlying operation does not allow for case weights.
library(recipes) recipe <- recipe(example_nested_data, z ~ x + id) %>% step_nest(id) recipe %>% prep() %>% bake(NULL) recipe2 <- recipe(example_nested_data, z ~ x + id) %>% step_nest(-c(x, z)) recipe2 %>% prep() %>% bake(NULL)
library(recipes) recipe <- recipe(example_nested_data, z ~ x + id) %>% step_nest(id) recipe %>% prep() %>% bake(NULL) recipe2 <- recipe(example_nested_data, z ~ x + id) %>% step_nest(-c(x, z)) recipe2 %>% prep() %>% bake(NULL)
Use broom functions on fitted nested models.
tidy.nested_model_fit()
summarises components of each model within a
nested model fit, indicating which nested data frame each row corresponds
to.
glance.nested_model_fit()
summarises a nested model, returning a
tibble::tibble()
with 1 row.
glance_nested()
summarises each model within a nested model fit,
returning a tibble::tibble()
with the same number of rows as the number
of inner models.
## S3 method for class 'nested_model_fit' tidy(x, ...) ## S3 method for class 'nested_model_fit' glance(x, ...) glance_nested(x, ...)
## S3 method for class 'nested_model_fit' tidy(x, ...) ## S3 method for class 'nested_model_fit' glance(x, ...) glance_nested(x, ...)
x |
A |
... |
Additional arguments passed into their respective functions.
(e.g. for |
generics::glance()
states that glance()
methods should always return 1
row outputs for non-empty inputs. The 'nestedmodels' package is no
exception: glance()
methods will combine rows to produce a result with a
single row. Specifically:
If a column contains 1 unique value, that value is used.
If a column is numeric, the mean will be calculated.
Otherwise, the results will be combined into a list.
A tibble::tibble()
. With glance.nested_model_fit()
, the
tibble will have 1 row.
generics::tidy()
generics::glance()
library(dplyr) library(parsnip) library(broom) data <- filter(example_nested_data, id %in% 1:5) model <- linear_reg() %>% set_engine("lm") %>% nested() fit <- fit( model, z ~ x + y + a + b, group_by(data, id) ) tidy(fit) glance(fit) glance_nested(fit)
library(dplyr) library(parsnip) library(broom) data <- filter(example_nested_data, id %in% 1:5) model <- linear_reg() %>% set_engine("lm") %>% nested() fit <- fit( model, z ~ x + y + a + b, group_by(data, id) ) tidy(fit) glance(fit) glance_nested(fit)