You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To evaluate this model, we will use 10 repeats of 10-fold cross-validation and use the 100 holdout samples to evaluate the overall accuracy of the model.
63
71
64
72
First, let's make the splits of the data:
65
-
```{r model_vfold, message=FALSE}
73
+
```{r}
74
+
#| label: model_vfold
75
+
#| message: false
66
76
library(rsample)
67
77
set.seed(4622)
68
78
rs_obj <- vfold_cv(attrition, v = 10, repeats = 10)
@@ -77,7 +87,8 @@ Now let's write a function that will, for each resample:
77
87
78
88
Here is our function:
79
89
80
-
```{r lm_func}
90
+
```{r}
91
+
#| label: lm_func
81
92
## splits will be the `rsplit` object with the 90/10 partition
@@ -134,15 +150,20 @@ Traditionally, the bootstrap has been primarily used to empirically determine th
134
150
135
151
For example, are there differences in the median monthly income between genders?
136
152
137
-
```{r type_plot, fig.alt = "Two boxplots of monthly income separated by gender, showing a slight difference in median but largely overlapping boxes."}
153
+
```{r}
154
+
#| label: type_plot
155
+
#| fig.alt: >
156
+
#| Two boxplots of monthly income separated by gender, showing a slight
157
+
#| difference in median but largely overlapping boxes.
138
158
ggplot(attrition, aes(x = Gender, y = MonthlyIncome)) +
139
159
geom_boxplot() +
140
160
scale_y_log10()
141
161
```
142
162
143
163
If we wanted to compare the genders, we could conduct a _t_-test or rank-based test. Instead, let's use the bootstrap to see if there is a difference in the median incomes for the two groups. We need a simple function to compute this statistic on the resample:
The bootstrap distribution of this statistic has a slightly bimodal and skewed distribution:
167
190
168
-
```{r stats_plot, fig.alt = "The bootstrap distribution of the differences in median monthly income: it is slightly bimodal and left-skewed."}
191
+
```{r}
192
+
#| label: stats_plot
193
+
#| fig.alt: >
194
+
#| The bootstrap distribution of the differences in median monthly income:
195
+
#| it is slightly bimodal and left-skewed.
169
196
ggplot(bt_resamples, aes(x = wage_diff)) +
170
197
geom_line(stat = "density", adjust = 1.25) +
171
198
xlab("Difference in Median Monthly Income (Female - Male)")
172
199
```
173
200
174
201
The variation is considerable in this statistic. One method of computing a confidence interval is to take the percentiles of the bootstrap distribution. A 95% confidence interval for the difference in the means would be:
175
202
176
-
```{r ci}
203
+
```{r}
204
+
#| label: ci
177
205
quantile(bt_resamples$wage_diff,
178
206
probs = c(0.025, 0.975))
179
207
```
@@ -184,7 +212,8 @@ The calculated 95% confidence interval contains zero, so we don't have evidence
184
212
185
213
Unless there is already a column in the resample object that contains the fitted model, a function can be used to fit the model and save all of the model coefficients. The [broom package](https://cran.r-project.org/package=broom) package has a `tidy()` function that will save the coefficients in a data frame. Instead of returning a data frame with a row for each model term, we will save a data frame with a single row and columns for each model term. As before, `purrr::map()` can be used to estimate and save these values for each split.
186
214
187
-
```{r coefs}
215
+
```{r}
216
+
#| label: coefs
188
217
glm_coefs <- function(splits, ...) {
189
218
## use `analysis` or `as.data.frame` to get the analysis data
190
219
mod <- glm(..., data = analysis(splits), family = binomial)
@@ -201,15 +230,17 @@ bt_resamples$betas[[1]]
201
230
202
231
As previously mentioned, the [broom package](https://cran.r-project.org/package=broom) contains a class called `tidy` that created representations of objects that can be easily used for analysis, plotting, etc. rsample contains `tidy` methods for `rset` and `rsplit` objects. For example:
Copy file name to clipboardExpand all lines: vignettes/rsample.Rmd
+10-4
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,9 @@ output:
8
8
toc: yes
9
9
---
10
10
11
-
```{r ex_setup, include=FALSE}
11
+
```{r}
12
+
#| label: ex_setup
13
+
#| include: false
12
14
knitr::opts_chunk$set(
13
15
message = FALSE,
14
16
digits = 3,
@@ -28,7 +30,9 @@ The main class in the package (`rset`) is for a _set_ or _collection_ of resampl
28
30
29
31
Like [modelr](https://cran.r-project.org/package=modelr), the resamples are stored in data-frame-like `tibble` object. As a simple example, here is a small set of bootstraps of the `mtcars` data:
30
32
31
-
```{r mtcars_bt, message=FALSE}
33
+
```{r}
34
+
#| label: mtcars_bt
35
+
#| message: false
32
36
library(rsample)
33
37
set.seed(8584)
34
38
bt_resamples <- bootstraps(mtcars, times = 3)
@@ -48,14 +52,16 @@ In this package we use the following terminology for the two partitions that com
48
52
(Aside: While some might use the term "training" and "testing" for these data sets, we avoid them since those labels often conflict with the data that result from an initial partition of the data that is typically done _before_ resampling. The training/test split can be conducted using the `initial_split()` function in this package.)
49
53
50
54
Let's look at one of the `rsplit` objects
51
-
```{r rsplit}
55
+
```{r}
56
+
#| label: rsplit
52
57
first_resample <- bt_resamples$splits[[1]]
53
58
first_resample
54
59
```
55
60
This indicates that there were `r dim(bt_resamples$splits[[1]])["analysis"]` data points in the analysis set, `r dim(bt_resamples$splits[[1]])["assessment"]` instances were in the assessment set, and that the original data contained `r dim(bt_resamples$splits[[1]])["n"]` data points. These results can also be determined using the `dim` function on an `rsplit` object.
56
61
57
62
To obtain either of these data sets from an `rsplit`, the `as.data.frame()` function can be used. By default, the analysis set is returned but the `data` option can be used to return the assessment data:
58
-
```{r rsplit_df}
63
+
```{r}
64
+
#| label: rsplit_df
59
65
head(as.data.frame(first_resample))
60
66
as.data.frame(first_resample, data = "assessment")
0 commit comments