Skip to content

Commit d9d23dd

Browse files
committed
prep for quarto
1 parent 11d419e commit d9d23dd

File tree

4 files changed

+65
-24
lines changed

4 files changed

+65
-24
lines changed

vignettes/.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
*.html
22
*.R
3+
4+
/.quarto/

vignettes/Common_Patterns.Rmd

+4-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ vignette: >
77
%\VignetteEncoding{UTF-8}
88
---
99

10-
```{r, include = FALSE}
10+
```{r}
11+
#| include: false
1112
knitr::opts_chunk$set(
1213
collapse = TRUE,
1314
comment = "#>",
@@ -19,7 +20,8 @@ The rsample package provides a number of resampling methods which are broadly ap
1920

2021
Let's go ahead and load rsample now:
2122

22-
```{r setup}
23+
```{r}
24+
#| label: setup
2325
library(rsample)
2426
```
2527

vignettes/Working_with_rsets.Rmd

+49-18
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ output:
88
toc: yes
99
---
1010

11-
```{r ex_setup, include=FALSE}
11+
```{r}
12+
#| label: ex_setup
13+
#| include: false
1214
knitr::opts_chunk$set(
1315
message = FALSE,
1416
digits = 3,
@@ -19,7 +21,9 @@ knitr::opts_chunk$set(
1921
options(digits = 3, width = 90)
2022
```
2123

22-
```{r ggplot2_setup, include = FALSE}
24+
```{r}
25+
#| label: ggplot2_setup
26+
#| include: false
2327
library(ggplot2)
2428
theme_set(theme_bw())
2529
```
@@ -35,7 +39,9 @@ Let's use the `attrition` data set. From its documentation:
3539
3640
The data can be accessed using
3741

38-
```{r attrition, message=FALSE}
42+
```{r}
43+
#| label: attrition
44+
#| message: false
3945
library(rsample)
4046
data("attrition", package = "modeldata")
4147
names(attrition)
@@ -55,14 +61,18 @@ glm(Attrition ~ JobSatisfaction + Gender + MonthlyIncome,
5561

5662
For convenience, we'll create a formula object that will be used later:
5763

58-
```{r form, message=FALSE}
64+
```{r}
65+
#| label: form
66+
#| message: false
5967
mod_form <- as.formula(Attrition ~ JobSatisfaction + Gender + MonthlyIncome)
6068
```
6169

6270
To evaluate this model, we will use 10 repeats of 10-fold cross-validation and use the 100 holdout samples to evaluate the overall accuracy of the model.
6371

6472
First, let's make the splits of the data:
65-
```{r model_vfold, message=FALSE}
73+
```{r}
74+
#| label: model_vfold
75+
#| message: false
6676
library(rsample)
6777
set.seed(4622)
6878
rs_obj <- vfold_cv(attrition, v = 10, repeats = 10)
@@ -77,7 +87,8 @@ Now let's write a function that will, for each resample:
7787

7888
Here is our function:
7989

80-
```{r lm_func}
90+
```{r}
91+
#| label: lm_func
8192
## splits will be the `rsplit` object with the 90/10 partition
8293
holdout_results <- function(splits, ...) {
8394
# Fit the model to the 90%
@@ -99,7 +110,9 @@ holdout_results <- function(splits, ...) {
99110

100111
For example:
101112

102-
```{r onefold, warning = FALSE}
113+
```{r}
114+
#| label: onefold
115+
#| warning: false
103116
example <- holdout_results(rs_obj$splits[[1]], mod_form)
104117
dim(example)
105118
dim(assessment(rs_obj$splits[[1]]))
@@ -111,7 +124,9 @@ For this model, the `.fitted` value is the linear predictor in log-odds units.
111124

112125
To compute this data set for each of the 100 resamples, we'll use the `map()` function from the purrr package:
113126

114-
```{r model_purrr, warning=FALSE}
127+
```{r}
128+
#| label: model_purrr
129+
#| warning: false
115130
library(purrr)
116131
rs_obj$results <- map(rs_obj$splits,
117132
holdout_results,
@@ -121,7 +136,8 @@ rs_obj
121136

122137
Now we can compute the accuracy values for all of the assessment data sets:
123138

124-
```{r model_acc}
139+
```{r}
140+
#| label: model_acc
125141
rs_obj$accuracy <- map_dbl(rs_obj$results, function(x) mean(x$correct))
126142
summary(rs_obj$accuracy)
127143
```
@@ -134,15 +150,20 @@ Traditionally, the bootstrap has been primarily used to empirically determine th
134150

135151
For example, are there differences in the median monthly income between genders?
136152

137-
```{r type_plot, fig.alt = "Two boxplots of monthly income separated by gender, showing a slight difference in median but largely overlapping boxes."}
153+
```{r}
154+
#| label: type_plot
155+
#| fig.alt: >
156+
#| Two boxplots of monthly income separated by gender, showing a slight
157+
#| difference in median but largely overlapping boxes.
138158
ggplot(attrition, aes(x = Gender, y = MonthlyIncome)) +
139159
geom_boxplot() +
140160
scale_y_log10()
141161
```
142162

143163
If we wanted to compare the genders, we could conduct a _t_-test or rank-based test. Instead, let's use the bootstrap to see if there is a difference in the median incomes for the two groups. We need a simple function to compute this statistic on the resample:
144164

145-
```{r mean_diff}
165+
```{r}
166+
#| label: mean_diff
146167
median_diff <- function(splits) {
147168
x <- analysis(splits)
148169
median(x$MonthlyIncome[x$Gender == "Female"]) -
@@ -152,28 +173,35 @@ median_diff <- function(splits) {
152173

153174
Now we would create a large number of bootstrap samples (say 2000+). For illustration, we'll only do 500 in this document.
154175

155-
```{r boot_mean_diff}
176+
```{r}
177+
#| label: boot_mean_diff
156178
set.seed(353)
157179
bt_resamples <- bootstraps(attrition, times = 500)
158180
```
159181

160182
This function is then computed across each resample:
161183

162-
```{r stats}
184+
```{r}
185+
#| label: stats
163186
bt_resamples$wage_diff <- map_dbl(bt_resamples$splits, median_diff)
164187
```
165188

166189
The bootstrap distribution of this statistic has a slightly bimodal and skewed distribution:
167190

168-
```{r stats_plot, fig.alt = "The bootstrap distribution of the differences in median monthly income: it is slightly bimodal and left-skewed."}
191+
```{r}
192+
#| label: stats_plot
193+
#| fig.alt: >
194+
#| The bootstrap distribution of the differences in median monthly income:
195+
#| it is slightly bimodal and left-skewed.
169196
ggplot(bt_resamples, aes(x = wage_diff)) +
170197
geom_line(stat = "density", adjust = 1.25) +
171198
xlab("Difference in Median Monthly Income (Female - Male)")
172199
```
173200

174201
The variation is considerable in this statistic. One method of computing a confidence interval is to take the percentiles of the bootstrap distribution. A 95% confidence interval for the difference in the means would be:
175202

176-
```{r ci}
203+
```{r}
204+
#| label: ci
177205
quantile(bt_resamples$wage_diff,
178206
probs = c(0.025, 0.975))
179207
```
@@ -184,7 +212,8 @@ The calculated 95% confidence interval contains zero, so we don't have evidence
184212

185213
Unless there is already a column in the resample object that contains the fitted model, a function can be used to fit the model and save all of the model coefficients. The [broom package](https://cran.r-project.org/package=broom) package has a `tidy()` function that will save the coefficients in a data frame. Instead of returning a data frame with a row for each model term, we will save a data frame with a single row and columns for each model term. As before, `purrr::map()` can be used to estimate and save these values for each split.
186214

187-
```{r coefs}
215+
```{r}
216+
#| label: coefs
188217
glm_coefs <- function(splits, ...) {
189218
## use `analysis` or `as.data.frame` to get the analysis data
190219
mod <- glm(..., data = analysis(splits), family = binomial)
@@ -201,15 +230,17 @@ bt_resamples$betas[[1]]
201230

202231
As previously mentioned, the [broom package](https://cran.r-project.org/package=broom) contains a class called `tidy` that created representations of objects that can be easily used for analysis, plotting, etc. rsample contains `tidy` methods for `rset` and `rsplit` objects. For example:
203232

204-
```{r tidy_rsplit}
233+
```{r}
234+
#| label: tidy_rsplit
205235
first_resample <- bt_resamples$splits[[1]]
206236
class(first_resample)
207237
tidy(first_resample)
208238
```
209239

210240
and
211241

212-
```{r tidy_rset}
242+
```{r}
243+
#| label: tidy_rset
213244
class(bt_resamples)
214245
tidy(bt_resamples)
215246
```

vignettes/rsample.Rmd

+10-4
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ output:
88
toc: yes
99
---
1010

11-
```{r ex_setup, include=FALSE}
11+
```{r}
12+
#| label: ex_setup
13+
#| include: false
1214
knitr::opts_chunk$set(
1315
message = FALSE,
1416
digits = 3,
@@ -28,7 +30,9 @@ The main class in the package (`rset`) is for a _set_ or _collection_ of resampl
2830

2931
Like [modelr](https://cran.r-project.org/package=modelr), the resamples are stored in data-frame-like `tibble` object. As a simple example, here is a small set of bootstraps of the `mtcars` data:
3032

31-
```{r mtcars_bt, message=FALSE}
33+
```{r}
34+
#| label: mtcars_bt
35+
#| message: false
3236
library(rsample)
3337
set.seed(8584)
3438
bt_resamples <- bootstraps(mtcars, times = 3)
@@ -48,14 +52,16 @@ In this package we use the following terminology for the two partitions that com
4852
(Aside: While some might use the term "training" and "testing" for these data sets, we avoid them since those labels often conflict with the data that result from an initial partition of the data that is typically done _before_ resampling. The training/test split can be conducted using the `initial_split()` function in this package.)
4953

5054
Let's look at one of the `rsplit` objects
51-
```{r rsplit}
55+
```{r}
56+
#| label: rsplit
5257
first_resample <- bt_resamples$splits[[1]]
5358
first_resample
5459
```
5560
This indicates that there were `r dim(bt_resamples$splits[[1]])["analysis"]` data points in the analysis set, `r dim(bt_resamples$splits[[1]])["assessment"]` instances were in the assessment set, and that the original data contained `r dim(bt_resamples$splits[[1]])["n"]` data points. These results can also be determined using the `dim` function on an `rsplit` object.
5661

5762
To obtain either of these data sets from an `rsplit`, the `as.data.frame()` function can be used. By default, the analysis set is returned but the `data` option can be used to return the assessment data:
58-
```{r rsplit_df}
63+
```{r}
64+
#| label: rsplit_df
5965
head(as.data.frame(first_resample))
6066
as.data.frame(first_resample, data = "assessment")
6167
```

0 commit comments

Comments
 (0)