README

{recforest} offers a flexible solution for analyzing recurrent events in survival data, outperforming traditional methods like the Cox model, which struggle with repeated events (e.g., hospital readmissions) and terminal events like death. By leveraging machine learning (Random Survival Forests), RecForest models both the timing and frequency of events, even with right-censored data, leading to more accurate predictions and insights, ultimately aiding in better decision-making and patient care.

Overview

Installation

# Install from CRAN:
install.packages("recforest")

# Or the development version from GitHub:
# install.packages("pak")
pak::pak("XXXX/recforest")

Usage

A example dataset is provided with the package. It is a modified version of the bladder1 dataset, studying bladder cancer recurrences, from the {survival} package, adapted to be usable with {recforest}. Please use ?survival::bladder and ?bladder1_recforest to get more information about the dataset.

library(recforest)
data("bladder1_recforest")
head(bladder1_recforest)
#> # A tibble: 6 × 8
#>      id t.start t.stop treatment number  size death event
#>   <int>   <int>  <int> <fct>      <int> <int> <dbl> <dbl>
#> 1     1       0      0 placebo        1     1     0     0
#> 2     2       0      1 placebo        1     3     0     0
#> 3     3       0      4 placebo        2     1     0     0
#> 4     4       0      7 placebo        1     1     0     0
#> 5     5       0     10 placebo        5     1     0     0
#> 6     6       0      6 placebo        4     1     0     1

Train the model

trained_forest <- train_forest(
  data = bladder1_recforest,
  id_var = "id",
  covariates = c("treatment", "number", "size"),
  time_vars = c("t.start", "t.stop"),
  death_var = "death",
  event = "event",
  n_trees = 5,
  n_bootstrap = 70,
  mtry = 2,
  minsplit = 3,
  nodesize = 15,
  method = "NAa",
  min_score = 5,
  max_nodes = 20,
  seed = 111,
  parallel = FALSE,
  verbose = FALSE
)

A full explanation of the data-related and model-related parameters is provided in the vignette (see Further details).

Using parallel computing

The implementation of parallel computing in this package is based on the {future} and {future.apply} packages. To enable parallel processing, the parallel parameter must be set to TRUE. The number of cores to use can be specified by adjusting the workers parameter in the future::plan() function, which configures the parallelization strategy. Two commonly used strategies are:

In both cases, the number of cores (n_cores) should be defined by the user based on the available resources.

# Define the strategy and number of cores
n_cores <- min(future::availableCores(), n_trees)
future::plan(future::multisession, workers = n_cores - 1)

If you wish to use parallel computing, please refer to the {future} package documentation for more information.

Analyzing results

print(trained_forest)
#>
#> ── Tree 1 ──
#>
#> ℹ Number of nodes : 5
#> ℹ c_index : 0.7490882567469
#> ℹ mse_imse : 316.52338769398
#> ℹ mse_iscore : -23.8459208072833
#>
#> ── Tree 2 ──
#>
#> ℹ Number of nodes : 7
#> ℹ c_index : 0.749320446994866
#> ℹ mse_imse : 317.020277729943
#> ℹ mse_iscore : -24.3379911330769
#>
#> ── Tree 3 ──
#>
#> ℹ Number of nodes : 9
#> ℹ c_index : 0.725611597704621
#> ℹ mse_imse : 552.713717581106
#> ℹ mse_iscore : -260.02620872387
#>
#> ── Tree 4 ──
#>
#> ℹ Number of nodes : 5
#> ℹ c_index : 0.757197981596913
#> ℹ mse_imse : 391.055121586804
#> ℹ mse_iscore : -98.20144371709
#>
#> ── Tree 5 ──
#>
#> ℹ Number of nodes : 7
#> ℹ c_index : 0.75103734439834
#> ℹ mse_imse : 419.451806839106
#> ℹ mse_iscore : -126.772535366445

summary(trained_forest)
#>
#> ── Data summary ────────────────────────────────────────────────────────────────
#> ℹ Number of individuals : 118
#> ℹ Number of predictors : 3
#>
#> ── Model parameters ────────────────────────────────────────────────────────────
#> ℹ mtry : 2
#> ℹ minsplit : 3
#> ℹ nodesize : 15
#> ℹ method : NAa
#> ℹ min_score : 5
#> ℹ max_nodes : 20
#>
#> ── Metrics ─────────────────────────────────────────────────────────────────────
#> ℹ c_index : 0.746451125488328
#> ℹ mse_imse : 399.352862286188
#> ℹ mse_iscore : -106.636819949553
#> ℹ computation time (seconds) : 4.4

Prediction

The model can be used to predict an expected mean cumulative number of recurrent events per individual at the end of follow-up.

predictions <- predict(
  trained_forest,
  newdata = bladder1_recforest,
  id_var = "id",
  covariates = c("treatment", "number", "size"),
  time_vars = c("t.start", "t.stop"),
  death_var = "death"
)

Further details

A deeper explanation of the methodology and the features of the package can be found in the Vignettes.