Exciting - with the tune package, @topepos says the tidymodels suite of packages is now “a fully operational death star” for modeling in #rstats. Check them out! https://t.co/nCdA1Fayvv
— Emily Robinson (@robinson_es) December 12, 2019
I have been trying to incorporate more of the tidymodels
suite of packages
into my predictive models workflow (rsample
, recipes
) but I find myself
frequently falling back on caret
for model tuning and fitting
because it’s what I know. This post is a work through from start to finish using
the tidymodels
suite. This post is NOT a tutorial for supervised learning.
This post often makes choices for to better illustrate package functionality
rather than developing the best (or even a good) predictive model.
It assumes you know how and why to split your data, resample, up/down sample,
tune parameters, evaluate models etc.
If you are looking for guides for machine learning, I highly recommend:
⭐
Learning to teach machines to learn:
This post from Alison Hill is FULL of great resources.
📕Applied Predictive Modeling
by Max Kuhn and Kjell Johnson
📕
Intro to Statisitcal Learning
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
📕
The Elements of Statistical Learning
by Trevor Hastie, Robert Tibshirani and Jerome Friedman
——
Let’s get started:
From the tidymodels readme:
tidymodels is a “meta-package” for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse
The Data
For this post I am going to use the
German Credit data
from the University of California Irving Machine Learning Repository. This
data set is included in the caret package, but the categorical variables are
dummy coded and I want to demonstrate some of the factor options in recipe, so
I am converting the dummy variables1.
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
data(GermanCredit, package = "caret")
glimpse(GermanCredit)
## Observations: 1,000
## Variables: 62
## $ Duration <int> 6, 48, 12, 42, 24, 36, 24, 36,…
## $ Amount <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
## $ ResidenceDuration <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
## $ Age <int> 67, 22, 49, 45, 53, 35, 53, 35…
## $ NumberExistingCredits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ NumberPeopleMaintenance <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
## $ Telephone <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ ForeignWorker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200 <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ CheckingAccountStatus.gt.200 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
## $ CreditHistory.NoCredit.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
## $ CreditHistory.Delay <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Purpose.NewCar <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ Purpose.UsedCar <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.Furniture.Equipment <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Radio.Television <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.DomesticAppliance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Vacation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100 <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
## $ SavingsAccountBonds.100.to.500 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ SavingsAccountBonds.Unknown <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.1.to.4 <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
## $ EmploymentDuration.4.to.7 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ EmploymentDuration.gt.7 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.Unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Divorced.Seperated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.NotSingle <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Personal.Male.Single <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ Personal.Male.Married.Widowed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Female.Single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Property.Insurance <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ Property.CarOther <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ Property.Unknown <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Housing.Own <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ Housing.ForFree <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
## $ Job.SkilledEmployee <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
Training Data and Testing Data using rsample
The first thing I want to do is split my data into a training and testing set.
rsample
has a function initial_split
that allows us to specify the proportion to be used for
training (the default is 0.75) and a strata for stratified sampling (this allows
us to ensure relative balance of our outcome between training and testing).
A seed should always be set to ensure reproducibility.
training
and testing
functions then allow us to access the respective
data.
set.seed(1450)
credit_split <- GermanCredit %>%
initial_split(prop = 0.75, strata = Class)
credit_split
## <750/250/1000>
get_prop <- function(data, variable){
data %>%
count({{variable}}) %>%
mutate(pct = n/sum(n))
}
map_dfr(list(training = training(credit_split),
testing = testing(credit_split)),
get_prop, variable = Class,
.id = 'source')
## # A tibble: 4 x 4
## source Class n pct
## <chr> <fct> <int> <dbl>
## 1 training Bad 225 0.3
## 2 training Good 525 0.7
## 3 testing Bad 75 0.3
## 4 testing Good 175 0.7
If we don’t set use stratified sampling for the outcome, we are likely to have some imbalance between training and testing as seen below.
set.seed(1450)
credit_split_no_strata <- GermanCredit %>%
initial_split(prop = 0.75)
credit_split_no_strata
## <750/250/1000>
map_dfr(list(training = training(credit_split_no_strata),
testing = testing(credit_split_no_strata)),
get_prop, variable = Class,
.id = 'source')
## # A tibble: 4 x 4
## source Class n pct
## <chr> <fct> <int> <dbl>
## 1 training Bad 215 0.287
## 2 training Good 535 0.713
## 3 testing Bad 85 0.34
## 4 testing Good 165 0.66
We will move forward with the training data from the stratified sampling.
training(credit_split) %>% glimpse
## Observations: 750
## Variables: 62
## $ Duration <int> 6, 48, 12, 42, 24, 24, 36, 12,…
## $ Amount <int> 1169, 5951, 2096, 7882, 4870, …
## $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 3, 2, 2, 4, 3, …
## $ ResidenceDuration <int> 4, 2, 3, 4, 4, 4, 2, 4, 2, 1, …
## $ Age <int> 67, 22, 49, 45, 53, 53, 35, 61…
## $ NumberExistingCredits <int> 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, …
## $ NumberPeopleMaintenance <int> 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, …
## $ Telephone <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ ForeignWorker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Class <fct> Good, Bad, Good, Good, Bad, Go…
## $ CheckingAccountStatus.lt.0 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.0.to.200 <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ CheckingAccountStatus.gt.200 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CheckingAccountStatus.none <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, …
## $ CreditHistory.NoCredit.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.ThisBank.AllPaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CreditHistory.PaidDuly <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ CreditHistory.Delay <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ CreditHistory.Critical <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ Purpose.NewCar <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, …
## $ Purpose.UsedCar <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ Purpose.Furniture.Equipment <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, …
## $ Purpose.Radio.Television <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Purpose.DomesticAppliance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Repairs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Education <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Vacation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Retraining <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Business <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Purpose.Other <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.lt.100 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
## $ SavingsAccountBonds.100.to.500 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ SavingsAccountBonds.500.to.1000 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ SavingsAccountBonds.gt.1000 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ SavingsAccountBonds.Unknown <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ EmploymentDuration.lt.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ EmploymentDuration.1.to.4 <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ EmploymentDuration.4.to.7 <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
## $ EmploymentDuration.gt.7 <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ EmploymentDuration.Unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Male.Divorced.Seperated <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Personal.Female.NotSingle <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ Personal.Male.Single <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, …
## $ Personal.Male.Married.Widowed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Personal.Female.Single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.None <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ OtherDebtorsGuarantors.CoApplicant <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherDebtorsGuarantors.Guarantor <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Property.RealEstate <dbl> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Property.Insurance <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, …
## $ Property.CarOther <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ Property.Unknown <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Bank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.Stores <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ OtherInstallmentPlans.None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Housing.Rent <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ Housing.Own <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, …
## $ Housing.ForFree <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ Job.UnemployedUnskilled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Job.UnskilledResident <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Job.SkilledEmployee <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, …
## $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
Preprocessing and Feature Engineering Using recipes
“Artwork by ‘@allison_horst’”
Next we will preprocess the data using recipes
.
- Our recipe will need a
training data set and a formula, minimally.
- One thing we may want to do
is to consider the ordinal nature of
EmploymentDuration
.- First we need to
make sure that the factors are ordered correctly.
- Then we can use
step_ordinalscore
to convert an ordinal factor to a numeric score.
- First we need to
make sure that the factors are ordered correctly.
- We can all convert our strings to factors
- and center and scale our numerics.
- Additional things that can be done we will skip:
- impute missing values
- remove variables that are highly sparse and unbalanced
- up or down sample unbalanced outcomes
- filter the data using the same syntax as
filter
There are many more steps available, the ref docs are here.
When we have finished with our recipe we prep
.
our_recipe <-
training(credit_split) %>%
recipe(Class ~ .) %>%
prep()
our_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 61
##
## Training data contained 750 data points and no missing data.
Once our recipe is ready to go, it’s time to juice!
train <- our_recipe %>%
prep() %>%
juice()
### equivilant to:
# bake(our_recipe, training(credit_split))
## when prep(retain = TRUE) (the default)
## and no prep steps have skip = TRUE
As you can see, our training data is now updated with the recipe steps,
including the conversion of EmploymentDuration
to an ordinal score.
glimpse(train)
The next thing I want to do is setup cross-validation to tune model parameters
using my training data. We will go back to rsample
for this.
set.seed(2134)
(cv_resamples <-
training(credit_split) %>%
vfold_cv(v = 10))
## # 10-fold cross-validation
## # A tibble: 10 x 2
## splits id
## <named list> <chr>
## 1 <split [675/75]> Fold01
## 2 <split [675/75]> Fold02
## 3 <split [675/75]> Fold03
## 4 <split [675/75]> Fold04
## 5 <split [675/75]> Fold05
## 6 <split [675/75]> Fold06
## 7 <split [675/75]> Fold07
## 8 <split [675/75]> Fold08
## 9 <split [675/75]> Fold09
## 10 <split [675/75]> Fold10
Alternatively, we could also use a bootstrap.
bt_resamples <-
training(credit_split) %>%
bootstraps(times = 10)
Setting our engines using parsnip
“Artwork by ‘@allison_horst’”
Parsnip allows us to specify models using a unified syntax regardless of the
syntax of the underlying engine.
All of the available parsnip models and engines can be found
here.
The basic syntax for setting up a parsnip model is
model(mode) %>% set_engine
like this:
logistic_reg(mode = 'classification') %>%
set_engine()
rand_forest(mode = 'classification') %>%
set_engine()
The specific arguments that are available for a given model type are found in
the model types documentation, ie ?rand_forest
tells us we can set mtry
,
trees
, and min_n
Let’s begin by setting up some model objects
# logisitic regression
log_reg_mod <-
logistic_reg() %>%
set_engine("glm")%>%
set_mode('classification')
# random forest
rf_mod <- rand_forest(
trees = tune(),
mtry = tune(),
min_n = tune(),
mode = 'classification'
) %>%
set_engine("ranger")
#k nearest neighbors
knn_mod <-
nearest_neighbor(neighbors = tune(),
weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
# boosted tree
boost_trees <-
boost_tree(
mode = "classification",
mtry = tune(),
trees = tune(),
min_n = tune(),
# only available for xgboost
tree_depth = tune(),
learn_rate = tune()
)
xboost_mod <-
boost_trees %>%
set_engine("xgboost")
c50_mod <-
boost_trees %>%
set_engine('C5.0')
Notice that we set two different engines for boosted trees.
Tuning with dials and tune
The dials
and tune
packages together allow us to tune our models using the
cross-validation resample we set up above. We use dials to specify our tuning
parameters. For clarity I have expliclity specified the package namespace,
even though dials and tune were loaded at the beginning.
(ctrl <- control_grid(verbose = TRUE))
## $verbose
## [1] TRUE
##
## $allow_par
## [1] TRUE
##
## $extract
## NULL
##
## $save_pred
## [1] FALSE
##
## $pkgs
## NULL
set.seed(2117)
(knn_grid <- knn_mod %>%
parameters() %>%
grid_regular(levels = c(15, 5)))
## # A tibble: 75 x 2
## neighbors weight_func
## <int> <chr>
## 1 1 rectangular
## 2 2 rectangular
## 3 3 rectangular
## 4 4 rectangular
## 5 5 rectangular
## 6 6 rectangular
## 7 7 rectangular
## 8 8 rectangular
## 9 9 rectangular
## 10 10 rectangular
## # … with 65 more rows
knn_tune <- tune_grid(
our_recipe,
model = knn_mod,
resamples = cv_resamples,
grid = knn_grid,
control = ctrl
)
## i Fold01: recipe
## ✓ Fold01: recipe
## i Fold01: model 1/5
## ✓ Fold01: model 1/5
## i Fold01: model 1/5 (predictions)
## i Fold01: model 2/5
## ✓ Fold01: model 2/5
## i Fold01: model 2/5 (predictions)
## i Fold01: model 3/5
## ✓ Fold01: model 3/5
## i Fold01: model 3/5 (predictions)
## i Fold01: model 4/5
## ✓ Fold01: model 4/5
## i Fold01: model 4/5 (predictions)
## i Fold01: model 5/5
## ✓ Fold01: model 5/5
## i Fold01: model 5/5 (predictions)
## i Fold02: recipe
## ✓ Fold02: recipe
## i Fold02: model 1/5
## ✓ Fold02: model 1/5
## i Fold02: model 1/5 (predictions)
## i Fold02: model 2/5
## ✓ Fold02: model 2/5
## i Fold02: model 2/5 (predictions)
## i Fold02: model 3/5
## ✓ Fold02: model 3/5
## i Fold02: model 3/5 (predictions)
## i Fold02: model 4/5
## ✓ Fold02: model 4/5
## i Fold02: model 4/5 (predictions)
## i Fold02: model 5/5
## ✓ Fold02: model 5/5
## i Fold02: model 5/5 (predictions)
## i Fold03: recipe
## ✓ Fold03: recipe
## i Fold03: model 1/5
## ✓ Fold03: model 1/5
## i Fold03: model 1/5 (predictions)
## i Fold03: model 2/5
## ✓ Fold03: model 2/5
## i Fold03: model 2/5 (predictions)
## i Fold03: model 3/5
## ✓ Fold03: model 3/5
## i Fold03: model 3/5 (predictions)
## i Fold03: model 4/5
## ✓ Fold03: model 4/5
## i Fold03: model 4/5 (predictions)
## i Fold03: model 5/5
## ✓ Fold03: model 5/5
## i Fold03: model 5/5 (predictions)
## i Fold04: recipe
## ✓ Fold04: recipe
## i Fold04: model 1/5
## ✓ Fold04: model 1/5
## i Fold04: model 1/5 (predictions)
## i Fold04: model 2/5
## ✓ Fold04: model 2/5
## i Fold04: model 2/5 (predictions)
## i Fold04: model 3/5
## ✓ Fold04: model 3/5
## i Fold04: model 3/5 (predictions)
## i Fold04: model 4/5
## ✓ Fold04: model 4/5
## i Fold04: model 4/5 (predictions)
## i Fold04: model 5/5
## ✓ Fold04: model 5/5
## i Fold04: model 5/5 (predictions)
## i Fold05: recipe
## ✓ Fold05: recipe
## i Fold05: model 1/5
## ✓ Fold05: model 1/5
## i Fold05: model 1/5 (predictions)
## i Fold05: model 2/5
## ✓ Fold05: model 2/5
## i Fold05: model 2/5 (predictions)
## i Fold05: model 3/5
## ✓ Fold05: model 3/5
## i Fold05: model 3/5 (predictions)
## i Fold05: model 4/5
## ✓ Fold05: model 4/5
## i Fold05: model 4/5 (predictions)
## i Fold05: model 5/5
## ✓ Fold05: model 5/5
## i Fold05: model 5/5 (predictions)
## i Fold06: recipe
## ✓ Fold06: recipe
## i Fold06: model 1/5
## ✓ Fold06: model 1/5
## i Fold06: model 1/5 (predictions)
## i Fold06: model 2/5
## ✓ Fold06: model 2/5
## i Fold06: model 2/5 (predictions)
## i Fold06: model 3/5
## ✓ Fold06: model 3/5
## i Fold06: model 3/5 (predictions)
## i Fold06: model 4/5
## ✓ Fold06: model 4/5
## i Fold06: model 4/5 (predictions)
## i Fold06: model 5/5
## ✓ Fold06: model 5/5
## i Fold06: model 5/5 (predictions)
## i Fold07: recipe
## ✓ Fold07: recipe
## i Fold07: model 1/5
## ✓ Fold07: model 1/5
## i Fold07: model 1/5 (predictions)
## i Fold07: model 2/5
## ✓ Fold07: model 2/5
## i Fold07: model 2/5 (predictions)
## i Fold07: model 3/5
## ✓ Fold07: model 3/5
## i Fold07: model 3/5 (predictions)
## i Fold07: model 4/5
## ✓ Fold07: model 4/5
## i Fold07: model 4/5 (predictions)
## i Fold07: model 5/5
## ✓ Fold07: model 5/5
## i Fold07: model 5/5 (predictions)
## i Fold08: recipe
## ✓ Fold08: recipe
## i Fold08: model 1/5
## ✓ Fold08: model 1/5
## i Fold08: model 1/5 (predictions)
## i Fold08: model 2/5
## ✓ Fold08: model 2/5
## i Fold08: model 2/5 (predictions)
## i Fold08: model 3/5
## ✓ Fold08: model 3/5
## i Fold08: model 3/5 (predictions)
## i Fold08: model 4/5
## ✓ Fold08: model 4/5
## i Fold08: model 4/5 (predictions)
## i Fold08: model 5/5
## ✓ Fold08: model 5/5
## i Fold08: model 5/5 (predictions)
## i Fold09: recipe
## ✓ Fold09: recipe
## i Fold09: model 1/5
## ✓ Fold09: model 1/5
## i Fold09: model 1/5 (predictions)
## i Fold09: model 2/5
## ✓ Fold09: model 2/5
## i Fold09: model 2/5 (predictions)
## i Fold09: model 3/5
## ✓ Fold09: model 3/5
## i Fold09: model 3/5 (predictions)
## i Fold09: model 4/5
## ✓ Fold09: model 4/5
## i Fold09: model 4/5 (predictions)
## i Fold09: model 5/5
## ✓ Fold09: model 5/5
## i Fold09: model 5/5 (predictions)
## i Fold10: recipe
## ✓ Fold10: recipe
## i Fold10: model 1/5
## ✓ Fold10: model 1/5
## i Fold10: model 1/5 (predictions)
## i Fold10: model 2/5
## ✓ Fold10: model 2/5
## i Fold10: model 2/5 (predictions)
## i Fold10: model 3/5
## ✓ Fold10: model 3/5
## i Fold10: model 3/5 (predictions)
## i Fold10: model 4/5
## ✓ Fold10: model 4/5
## i Fold10: model 4/5 (predictions)
## i Fold10: model 5/5
## ✓ Fold10: model 5/5
## i Fold10: model 5/5 (predictions)
(rf_params <-
dials::parameters(dials::trees(),
dials::min_n(),
finalize(mtry(), select(GermanCredit, -Class))
) %>%
dials::grid_latin_hypercube(size = 3))
## # A tibble: 3 x 3
## trees min_n mtry
## <int> <int> <int>
## 1 1462 30 22
## 2 944 9 55
## 3 636 23 6
ctrl <- control_grid()
(rf_tune <-
tune::tune_grid(
our_recipe,
model = rf_mod,
resamples = cv_resamples,
grid = rf_params,
control = ctrl
))
## i Creating pre-processing data to finalize unknown parameter: mtry
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## * <list> <chr> <list> <list>
## 1 <split [675/75]> Fold01 <tibble [6 × 6]> <tibble [0 × 1]>
## 2 <split [675/75]> Fold02 <tibble [6 × 6]> <tibble [0 × 1]>
## 3 <split [675/75]> Fold03 <tibble [6 × 6]> <tibble [0 × 1]>
## 4 <split [675/75]> Fold04 <tibble [6 × 6]> <tibble [0 × 1]>
## 5 <split [675/75]> Fold05 <tibble [6 × 6]> <tibble [0 × 1]>
## 6 <split [675/75]> Fold06 <tibble [6 × 6]> <tibble [0 × 1]>
## 7 <split [675/75]> Fold07 <tibble [6 × 6]> <tibble [0 × 1]>
## 8 <split [675/75]> Fold08 <tibble [6 × 6]> <tibble [0 × 1]>
## 9 <split [675/75]> Fold09 <tibble [6 × 6]> <tibble [0 × 1]>
## 10 <split [675/75]> Fold10 <tibble [6 × 6]> <tibble [0 × 1]>
best_rf <-
select_best(rf_tune, metric = "roc_auc", maximize = FALSE)
best_rf
## # A tibble: 1 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 55 944 9
rf_mod_final <- finalize_model(rf_mod, best_rf)
our_rec_final <- prep(our_recipe)
(credit_wfl <-
workflow() %>%
add_recipe(our_rec_final) %>%
add_model(log_reg_mod))
log_reg_fit <-
fit(credit_wfl, data = train)
rf_mod <-
credit_wfl %>%
update_model(rf_mod) %>%
fit(data = train)
Notice that we set two different engines for boosted trees.
Alternatively the original data can be downloaded here but I wanted tidy variable names and factors without an overly long post. ?The code for converting to factors can be found in a previous post.↩