I have been trying to incorporate more of the tidymodels suite of packages into my predictive models workflow (rsample, recipes) but I find myself frequently falling back on caret for model tuning and fitting because it’s what I know. This post is a work through from start to finish using the tidymodels suite. This post is NOT a tutorial for supervised learning. This post often makes choices for to better illustrate package functionality rather than developing the best (or even a good) predictive model. It assumes you know how and why to split your data, resample, up/down sample, tune parameters, evaluate models etc. If you are looking for guides for machine learning, I highly recommend:
Learning to teach machines to learn: This post from Alison Hill is FULL of great resources.
📕Applied Predictive Modeling by Max Kuhn and Kjell Johnson
📕 Intro to Statisitcal Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
📕 The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman
——
Let’s get started:
From the tidymodels readme:

tidymodels is a “meta-package” for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse

The Data
For this post I am going to use the German Credit data from the University of California Irving Machine Learning Repository. This data set is included in the caret package, but the categorical variables are dummy coded and I want to demonstrate some of the factor options in recipe, so I am converting the dummy variables1.

library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
data(GermanCredit, package = "caret") 
glimpse(GermanCredit)
## Observations: 1,000
## Variables: 62
## $ Duration                               <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount                                 <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage              <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration                      <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age                                    <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits                  <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance                <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone                              <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class                                  <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0             <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200         <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus.gt.200           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory.NoCredit.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly                 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory.Delay                    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical                 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Purpose.NewCar                         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ Purpose.UsedCar                        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.Furniture.Equipment            <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Radio.Television               <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.DomesticAppliance              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education                      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Vacation                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100             <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds.100.to.500         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds.Unknown            <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.1.to.4              <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration.4.to.7              <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration.gt.7                <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.Unemployed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Divorced.Seperated       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.NotSingle              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Personal.Male.Single                   <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ Personal.Male.Married.Widowed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Female.Single                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate                    <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Property.Insurance                     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Property.CarOther                      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Property.Unknown                       <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent                           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing.Own                            <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing.ForFree                        <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident                  <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job.SkilledEmployee                    <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …

Training Data and Testing Data using rsample
The first thing I want to do is split my data into a training and testing set. rsample has a function initial_split that allows us to specify the proportion to be used for training (the default is 0.75) and a strata for stratified sampling (this allows us to ensure relative balance of our outcome between training and testing). A seed should always be set to ensure reproducibility. training and testing functions then allow us to access the respective data.

set.seed(1450)
credit_split <- GermanCredit %>% 
  initial_split(prop = 0.75, strata = Class)
credit_split
## <750/250/1000>

get_prop <- function(data, variable){
  data %>% 
    count({{variable}}) %>% 
    mutate(pct = n/sum(n))
}
map_dfr(list(training = training(credit_split), 
             testing = testing(credit_split)), 
    get_prop, variable = Class, 
    .id = 'source')
## # A tibble: 4 x 4
##   source   Class     n   pct
##   <chr>    <fct> <int> <dbl>
## 1 training Bad     225   0.3
## 2 training Good    525   0.7
## 3 testing  Bad      75   0.3
## 4 testing  Good    175   0.7

If we don’t set use stratified sampling for the outcome, we are likely to have some imbalance between training and testing as seen below.

set.seed(1450)
credit_split_no_strata <- GermanCredit %>% 
  initial_split(prop = 0.75)
credit_split_no_strata
## <750/250/1000>

map_dfr(list(training = training(credit_split_no_strata), 
             testing = testing(credit_split_no_strata)), 
    get_prop, variable = Class, 
    .id = 'source')
## # A tibble: 4 x 4
##   source   Class     n   pct
##   <chr>    <fct> <int> <dbl>
## 1 training Bad     215 0.287
## 2 training Good    535 0.713
## 3 testing  Bad      85 0.34 
## 4 testing  Good    165 0.66

We will move forward with the training data from the stratified sampling.

training(credit_split) %>% glimpse
## Observations: 750
## Variables: 62
## $ Duration                               <int> 6, 48, 12, 42, 24, 24, 36, 12,…
## $ Amount                                 <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage              <int> 4, 2, 2, 2, 3, 3, 2, 2, 4, 3, …
## $ ResidenceDuration                      <int> 4, 2, 3, 4, 4, 4, 2, 4, 2, 1, …
## $ Age                                    <int> 67, 22, 49, 45, 53, 53, 35, 61…
## $ NumberExistingCredits                  <int> 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, …
## $ NumberPeopleMaintenance                <int> 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, …
## $ Telephone                              <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ ForeignWorker                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class                                  <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0             <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200         <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ CheckingAccountStatus.gt.200           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none             <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, …
## $ CreditHistory.NoCredit.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly                 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ CreditHistory.Delay                    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical                 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.NewCar                         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, …
## $ Purpose.UsedCar                        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Furniture.Equipment            <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Radio.Television               <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.DomesticAppliance              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education                      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Vacation                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100             <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ SavingsAccountBonds.100.to.500         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000            <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ SavingsAccountBonds.Unknown            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ EmploymentDuration.1.to.4              <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.4.to.7              <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
## $ EmploymentDuration.gt.7                <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.Unemployed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Male.Divorced.Seperated       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Personal.Female.NotSingle              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Single                   <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, …
## $ Personal.Male.Married.Widowed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.Single                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate                    <dbl> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Property.Insurance                     <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, …
## $ Property.CarOther                      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ Property.Unknown                       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent                           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ Housing.Own                            <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, …
## $ Housing.ForFree                        <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident                  <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Job.SkilledEmployee                    <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, …

Preprocessing and Feature Engineering Using recipes
“Artwork by @allison_horst” Next we will preprocess the data using recipes.

  • Our recipe will need a training data set and a formula, minimally.
  • One thing we may want to do is to consider the ordinal nature of EmploymentDuration.
    • First we need to make sure that the factors are ordered correctly.
    • Then we can use step_ordinalscore to convert an ordinal factor to a numeric score.
  • We can all convert our strings to factors
  • and center and scale our numerics.
  • Additional things that can be done we will skip:
    • impute missing values
    • remove variables that are highly sparse and unbalanced
    • up or down sample unbalanced outcomes
    • filter the data using the same syntax as filter

There are many more steps available, the ref docs are here.

When we have finished with our recipe we prep.

our_recipe <- 
  training(credit_split) %>% 
  recipe(Class ~ .) %>% 
  prep()
our_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         61
## 
## Training data contained 750 data points and no missing data.

Once our recipe is ready to go, it’s time to juice!

train <- our_recipe %>% 
  prep() %>% 
  juice()
### equivilant to:
# bake(our_recipe, training(credit_split))
## when prep(retain = TRUE) (the default)
## and no prep steps have skip = TRUE

As you can see, our training data is now updated with the recipe steps, including the conversion of EmploymentDuration to an ordinal score.

glimpse(train)

The next thing I want to do is setup cross-validation to tune model parameters using my training data. We will go back to rsample for this.

set.seed(2134)
(cv_resamples <- 
  training(credit_split) %>% 
  vfold_cv(v = 10))
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits           id    
##    <named list>     <chr> 
##  1 <split [675/75]> Fold01
##  2 <split [675/75]> Fold02
##  3 <split [675/75]> Fold03
##  4 <split [675/75]> Fold04
##  5 <split [675/75]> Fold05
##  6 <split [675/75]> Fold06
##  7 <split [675/75]> Fold07
##  8 <split [675/75]> Fold08
##  9 <split [675/75]> Fold09
## 10 <split [675/75]> Fold10

Alternatively, we could also use a bootstrap.

bt_resamples <- 
  training(credit_split) %>% 
  bootstraps(times = 10)

Setting our engines using parsnip
“Artwork by @allison_horst

Parsnip allows us to specify models using a unified syntax regardless of the syntax of the underlying engine. All of the available parsnip models and engines can be found here. The basic syntax for setting up a parsnip model is model(mode) %>% set_engine like this:

logistic_reg(mode = 'classification') %>% 
  set_engine()
rand_forest(mode = 'classification') %>% 
  set_engine()

The specific arguments that are available for a given model type are found in the model types documentation, ie ?rand_forest tells us we can set mtry, trees, and min_n

Let’s begin by setting up some model objects

# logisitic regression
log_reg_mod <- 
  logistic_reg() %>%
  set_engine("glm")%>% 
  set_mode('classification')

# random forest
rf_mod <- rand_forest(
  trees = tune(),
  mtry = tune(),
  min_n = tune(), 
  mode = 'classification'
  ) %>%
  set_engine("ranger")

#k nearest neighbors
knn_mod <- 
  nearest_neighbor(neighbors = tune(), 
                   weight_func = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

# boosted tree
boost_trees <- 
  boost_tree(
  mode = "classification", 
  mtry = tune(), 
  trees = tune(), 
  min_n = tune(), 
  # only available for xgboost
  tree_depth = tune(), 
  learn_rate = tune()
    ) 
xboost_mod <- 
  boost_trees %>% 
  set_engine("xgboost")

c50_mod <- 
  boost_trees %>% 
  set_engine('C5.0')

Notice that we set two different engines for boosted trees.

Tuning with dials and tune
The dials and tune packages together allow us to tune our models using the cross-validation resample we set up above. We use dials to specify our tuning parameters. For clarity I have expliclity specified the package namespace, even though dials and tune were loaded at the beginning.

(ctrl <- control_grid(verbose = TRUE))
## $verbose
## [1] TRUE
## 
## $allow_par
## [1] TRUE
## 
## $extract
## NULL
## 
## $save_pred
## [1] FALSE
## 
## $pkgs
## NULL
set.seed(2117)
(knn_grid <- knn_mod %>% 
  parameters() %>% 
  grid_regular(levels = c(15, 5)))
## # A tibble: 75 x 2
##    neighbors weight_func
##        <int> <chr>      
##  1         1 rectangular
##  2         2 rectangular
##  3         3 rectangular
##  4         4 rectangular
##  5         5 rectangular
##  6         6 rectangular
##  7         7 rectangular
##  8         8 rectangular
##  9         9 rectangular
## 10        10 rectangular
## # … with 65 more rows
knn_tune <- tune_grid(
  our_recipe, 
  model = knn_mod, 
  resamples = cv_resamples, 
  grid = knn_grid, 
  control = ctrl
)
## i Fold01: recipe
## ✓ Fold01: recipe
## i Fold01: model 1/5
## ✓ Fold01: model 1/5
## i Fold01: model 1/5 (predictions)
## i Fold01: model 2/5
## ✓ Fold01: model 2/5
## i Fold01: model 2/5 (predictions)
## i Fold01: model 3/5
## ✓ Fold01: model 3/5
## i Fold01: model 3/5 (predictions)
## i Fold01: model 4/5
## ✓ Fold01: model 4/5
## i Fold01: model 4/5 (predictions)
## i Fold01: model 5/5
## ✓ Fold01: model 5/5
## i Fold01: model 5/5 (predictions)
## i Fold02: recipe
## ✓ Fold02: recipe
## i Fold02: model 1/5
## ✓ Fold02: model 1/5
## i Fold02: model 1/5 (predictions)
## i Fold02: model 2/5
## ✓ Fold02: model 2/5
## i Fold02: model 2/5 (predictions)
## i Fold02: model 3/5
## ✓ Fold02: model 3/5
## i Fold02: model 3/5 (predictions)
## i Fold02: model 4/5
## ✓ Fold02: model 4/5
## i Fold02: model 4/5 (predictions)
## i Fold02: model 5/5
## ✓ Fold02: model 5/5
## i Fold02: model 5/5 (predictions)
## i Fold03: recipe
## ✓ Fold03: recipe
## i Fold03: model 1/5
## ✓ Fold03: model 1/5
## i Fold03: model 1/5 (predictions)
## i Fold03: model 2/5
## ✓ Fold03: model 2/5
## i Fold03: model 2/5 (predictions)
## i Fold03: model 3/5
## ✓ Fold03: model 3/5
## i Fold03: model 3/5 (predictions)
## i Fold03: model 4/5
## ✓ Fold03: model 4/5
## i Fold03: model 4/5 (predictions)
## i Fold03: model 5/5
## ✓ Fold03: model 5/5
## i Fold03: model 5/5 (predictions)
## i Fold04: recipe
## ✓ Fold04: recipe
## i Fold04: model 1/5
## ✓ Fold04: model 1/5
## i Fold04: model 1/5 (predictions)
## i Fold04: model 2/5
## ✓ Fold04: model 2/5
## i Fold04: model 2/5 (predictions)
## i Fold04: model 3/5
## ✓ Fold04: model 3/5
## i Fold04: model 3/5 (predictions)
## i Fold04: model 4/5
## ✓ Fold04: model 4/5
## i Fold04: model 4/5 (predictions)
## i Fold04: model 5/5
## ✓ Fold04: model 5/5
## i Fold04: model 5/5 (predictions)
## i Fold05: recipe
## ✓ Fold05: recipe
## i Fold05: model 1/5
## ✓ Fold05: model 1/5
## i Fold05: model 1/5 (predictions)
## i Fold05: model 2/5
## ✓ Fold05: model 2/5
## i Fold05: model 2/5 (predictions)
## i Fold05: model 3/5
## ✓ Fold05: model 3/5
## i Fold05: model 3/5 (predictions)
## i Fold05: model 4/5
## ✓ Fold05: model 4/5
## i Fold05: model 4/5 (predictions)
## i Fold05: model 5/5
## ✓ Fold05: model 5/5
## i Fold05: model 5/5 (predictions)
## i Fold06: recipe
## ✓ Fold06: recipe
## i Fold06: model 1/5
## ✓ Fold06: model 1/5
## i Fold06: model 1/5 (predictions)
## i Fold06: model 2/5
## ✓ Fold06: model 2/5
## i Fold06: model 2/5 (predictions)
## i Fold06: model 3/5
## ✓ Fold06: model 3/5
## i Fold06: model 3/5 (predictions)
## i Fold06: model 4/5
## ✓ Fold06: model 4/5
## i Fold06: model 4/5 (predictions)
## i Fold06: model 5/5
## ✓ Fold06: model 5/5
## i Fold06: model 5/5 (predictions)
## i Fold07: recipe
## ✓ Fold07: recipe
## i Fold07: model 1/5
## ✓ Fold07: model 1/5
## i Fold07: model 1/5 (predictions)
## i Fold07: model 2/5
## ✓ Fold07: model 2/5
## i Fold07: model 2/5 (predictions)
## i Fold07: model 3/5
## ✓ Fold07: model 3/5
## i Fold07: model 3/5 (predictions)
## i Fold07: model 4/5
## ✓ Fold07: model 4/5
## i Fold07: model 4/5 (predictions)
## i Fold07: model 5/5
## ✓ Fold07: model 5/5
## i Fold07: model 5/5 (predictions)
## i Fold08: recipe
## ✓ Fold08: recipe
## i Fold08: model 1/5
## ✓ Fold08: model 1/5
## i Fold08: model 1/5 (predictions)
## i Fold08: model 2/5
## ✓ Fold08: model 2/5
## i Fold08: model 2/5 (predictions)
## i Fold08: model 3/5
## ✓ Fold08: model 3/5
## i Fold08: model 3/5 (predictions)
## i Fold08: model 4/5
## ✓ Fold08: model 4/5
## i Fold08: model 4/5 (predictions)
## i Fold08: model 5/5
## ✓ Fold08: model 5/5
## i Fold08: model 5/5 (predictions)
## i Fold09: recipe
## ✓ Fold09: recipe
## i Fold09: model 1/5
## ✓ Fold09: model 1/5
## i Fold09: model 1/5 (predictions)
## i Fold09: model 2/5
## ✓ Fold09: model 2/5
## i Fold09: model 2/5 (predictions)
## i Fold09: model 3/5
## ✓ Fold09: model 3/5
## i Fold09: model 3/5 (predictions)
## i Fold09: model 4/5
## ✓ Fold09: model 4/5
## i Fold09: model 4/5 (predictions)
## i Fold09: model 5/5
## ✓ Fold09: model 5/5
## i Fold09: model 5/5 (predictions)
## i Fold10: recipe
## ✓ Fold10: recipe
## i Fold10: model 1/5
## ✓ Fold10: model 1/5
## i Fold10: model 1/5 (predictions)
## i Fold10: model 2/5
## ✓ Fold10: model 2/5
## i Fold10: model 2/5 (predictions)
## i Fold10: model 3/5
## ✓ Fold10: model 3/5
## i Fold10: model 3/5 (predictions)
## i Fold10: model 4/5
## ✓ Fold10: model 4/5
## i Fold10: model 4/5 (predictions)
## i Fold10: model 5/5
## ✓ Fold10: model 5/5
## i Fold10: model 5/5 (predictions)

(rf_params <- 
  dials::parameters(dials::trees(), 
                    dials::min_n(), 
                    finalize(mtry(), select(GermanCredit, -Class))
                    ) %>% 
  dials::grid_latin_hypercube(size = 3))
## # A tibble: 3 x 3
##   trees min_n  mtry
##   <int> <int> <int>
## 1  1462    30    22
## 2   944     9    55
## 3   636    23     6
ctrl <- control_grid()
(rf_tune <- 
    tune::tune_grid(
      our_recipe, 
      model = rf_mod, 
      resamples = cv_resamples, 
      grid = rf_params, 
      control = ctrl
    ))
## i Creating pre-processing data to finalize unknown parameter: mtry
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits           id     .metrics         .notes          
##  * <list>           <chr>  <list>           <list>          
##  1 <split [675/75]> Fold01 <tibble [6 × 6]> <tibble [0 × 1]>
##  2 <split [675/75]> Fold02 <tibble [6 × 6]> <tibble [0 × 1]>
##  3 <split [675/75]> Fold03 <tibble [6 × 6]> <tibble [0 × 1]>
##  4 <split [675/75]> Fold04 <tibble [6 × 6]> <tibble [0 × 1]>
##  5 <split [675/75]> Fold05 <tibble [6 × 6]> <tibble [0 × 1]>
##  6 <split [675/75]> Fold06 <tibble [6 × 6]> <tibble [0 × 1]>
##  7 <split [675/75]> Fold07 <tibble [6 × 6]> <tibble [0 × 1]>
##  8 <split [675/75]> Fold08 <tibble [6 × 6]> <tibble [0 × 1]>
##  9 <split [675/75]> Fold09 <tibble [6 × 6]> <tibble [0 × 1]>
## 10 <split [675/75]> Fold10 <tibble [6 × 6]> <tibble [0 × 1]>
best_rf <-
  select_best(rf_tune, metric = "roc_auc",  maximize = FALSE)
best_rf
## # A tibble: 1 x 3
##    mtry trees min_n
##   <int> <int> <int>
## 1    55   944     9
rf_mod_final <- finalize_model(rf_mod, best_rf)
our_rec_final <- prep(our_recipe)
(credit_wfl <- 
  workflow() %>% 
  add_recipe(our_rec_final) %>% 
  add_model(log_reg_mod))

log_reg_fit <- 
  fit(credit_wfl, data = train)

rf_mod <- 
  credit_wfl %>% 
  update_model(rf_mod) %>% 
  fit(data = train)

Notice that we set two different engines for boosted trees.


  1. Alternatively the original data can be downloaded here but I wanted tidy variable names and factors without an overly long post. ?The code for converting to factors can be found in a previous post.