Chapter 6 Regression

In this chapter, we will use machine learning to predict continuous values that are associated with documents. For example, let’s consider a filtered sample of opinions from the United States Supreme Court, available in the scotus (Hvitfeldt 2019b) package.

library(tidyverse)
library(scotus)

scotus_filtered %>%
  as_tibble()
## # A tibble: 10,000 x 5
##    year  case_name                  docket_number     id text                   
##    <chr> <chr>                      <chr>          <dbl> <chr>                  
##  1 1903  Clara Perry, Plff. In Err… 16             80304 "No. 16.\n State Repor…
##  2 1987  West v. Conrail            85-1804        96216 "No. 85-1804.\n\n     …
##  3 1957  Roth v. United States      582            89930 "Nos. 582, 61.\nNo. 61…
##  4 1913  McDermott v. Wisconsin     Nos. 112 and … 82218 "Nos. 112 and 113.\nMr…
##  5 1826  Wetzell v. Bussard         <NA>           52899 "Feb. 7th.\nThis cause…
##  6 1900  Forsyth v. Vehmeyer        180            79609 "No. 180.\nMr. Edward …
##  7 1871  Reed v. United States      <NA>           57846 "APPEAL and cross appe…
##  8 1833  United States v. Mills     <NA>           53394 "CERTIFICATE of Divisi…
##  9 1940  Puerto Rico v. Rubert Her… 582            87714 "No. 582.\nMr. Wm. Cat…
## 10 1910  Williams v. First Nat. Ba… 130            81588 "No. 130.\nThe defenda…
## # … with 9,990 more rows

This dataset contains the entire text of each opinion in the text column, along with the case_name and docket_number. Notice that we also have the year that each case was decided by the Supreme Court; this is basically a continuous variable (rather than a group membership of discrete label).

If we want to build a model to predict which court opinions were written in which years, we would build a regression model.

  • A classification model predicts a class label or group membership.
  • A regression model predicts a numeric or continuous value.

In text modeling, we use text data (such as the text of the court opinions), sometimes combined with other structured, non-text data, to predict the continuous value of interest (such as year of the court opinion).

6.1 A first regression model

Let’s build our first regression model using this sample of Supreme Court opinions. Before we start, let’s check out how many opinions we have for each decade in Figure 6.1.

scotus_filtered %>%
  mutate(
    year = as.numeric(year),
    year = 10 * (year %/% 10)
  ) %>%
  count(year) %>%
  ggplot(aes(year, n)) +
  geom_col() +
  labs(x = "Year", y = "Number of opinions per decade")
Supreme Court opinions per decade in sample

FIGURE 6.1: Supreme Court opinions per decade in sample

This sample of opinions reflects the distribution over time of available opinions for analysis; there are many more opinions per year in this dataset after about 1850 than before. This is an example of bias already in our data, as we discussed in the foreword to these chapters, and we will need to account for that in choosing a model and understanding our results.

6.1.1 Building our first regression model

Our first step in building a model is to split our data into training and testing sets. We use functions from tidymodels for this; we use initial_split() to set up how to split the data, and then we use the functions training() and testing() to create the datasets we need. Let’s also convert the year to a numeric value since it was originally stored as a character, and remove the ' character because of its effect on one of the models6 we want to try out.

library(tidymodels)
set.seed(1234)
scotus_split <- scotus_filtered %>%
  mutate(
    year = as.numeric(year),
    text = str_remove_all(text, "'")
  ) %>%
  initial_split()

scotus_train <- training(scotus_split)
scotus_test <- testing(scotus_split)

Next, let’s preprocess our data to get it ready for modeling using a recipe. We’ll use both general preprocessing functions from tidymodels and specialized functions just for text from textrecipes in this preprocessing. What are the steps in creating this recipe?

  • First, we must specify in our initial recipe() statement the form of our model (with the formula year ~ text, meaning we will predict the year of each opinion from the text of that opinion) and what our training data is.
  • Then, we tokenize (Chapter 2) the text of the court opinions.
  • Next, we filter to only keep the top 1000 tokens by term frequency. We filter out those less frequent words because we expect them to be too rare to be reliable, at least for our first attempt. (We are not removing stop words yet; we’ll explore removing them in Section 6.6.)
  • The recipe step step_tfidf(), used with defaults here, weights each token frequency by the inverse document frequency.
  • As a last step, we normalize (center and scale) these tf-idf values. We need to do this centering and scaling because it’s important for lasso regularization.
library(textrecipes)

scotus_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 1e3) %>%
  step_tfidf(text) %>%
  step_normalize(all_predictors())

scotus_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for text
## Text filtering for text
## Term frequency-inverse document frequency with text
## Centering and scaling for all_predictors()

Let’s create a workflow() to bundle together this recipe with any model specifications we may want to create later. A model workflow is a convenient way to combine different modeling components (a preprocessor plus a model specification); when these are bundled explicitly, it can be easier to keep track of your modeling plan, as well as fit your model and predict on new data.

First, let’s create an empty workflow() and then only add the data preprocessor scotus_rec to it.

scotus_wf <- workflow() %>%
  add_recipe(scotus_rec)

scotus_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: None
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## ● step_tokenize()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_normalize()

Notice that there is no model yet: Model: None. It’s time to specify the model we will use! Let’s build a lasso regression model with mixture = 1.

The mixture argument for a linear regression model specifies the proportion of L1 regularization. Using mixture = 1 sets up a lasso model, while mixture = 0 would be a ridge model.

Before fitting, we set up a model specification.

lasso_spec <- linear_reg(penalty = 0.1, mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

lasso_spec
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.1
##   mixture = 1
## 
## Computational engine: glmnet

Everything is now ready for us to fit our model. Let’s add our model to the workflow with add_model() and fit to our training data scotus_train.

lasso_fit <- scotus_wf %>%
  add_model(lasso_spec) %>%
  fit(data = scotus_train)

We have successfully fit a regularized regression model to this dataset of Supreme Court opinions. What does the result look like? We can access the fit using pull_workflow_fit(), and even tidy() the model coefficient results into a convenient dataframe format.

lasso_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  arrange(-estimate)
## # A tibble: 1,001 x 3
##    term                  estimate penalty
##    <chr>                    <dbl>   <dbl>
##  1 (Intercept)            1921.       0.1
##  2 tfidf_text_see            2.22     0.1
##  3 tfidf_text_appeals        1.95     0.1
##  4 tfidf_text_u.s            1.83     0.1
##  5 tfidf_text_ordered        1.48     0.1
##  6 tfidf_text_noted          1.40     0.1
##  7 tfidf_text_later          1.40     0.1
##  8 tfidf_text_petitioner     1.35     0.1
##  9 tfidf_text_example        1.32     0.1
## 10 tfidf_text_basis          1.27     0.1
## # … with 991 more rows

We see here, for the penalty we specified (penalty = 0.1), what terms contribute to a Supreme Court opinion being written more recently, plus the intercept for this particular penalty.

What terms contribute to a Supreme Court opinion being written further in the past, for this first attempt at a model?

lasso_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  arrange(estimate)
## # A tibble: 1,001 x 3
##    term            estimate penalty
##    <chr>              <dbl>   <dbl>
##  1 tfidf_text_the     -3.30     0.1
##  2 tfidf_text_it      -2.41     0.1
##  3 tfidf_text_but     -2.21     0.1
##  4 tfidf_text_be      -2.05     0.1
##  5 tfidf_text_1st     -2.01     0.1
##  6 tfidf_text_of      -2.00     0.1
##  7 tfidf_text_this    -1.98     0.1
##  8 tfidf_text_he      -1.91     0.1
##  9 tfidf_text_been    -1.80     0.1
## 10 tfidf_text_rep     -1.78     0.1
## # … with 991 more rows

6.1.2 Evaluation

One option for evaluating our model is to predict one time on the testing set to measure performance.

The testing set is extremely valuable data, however, and in real world situations, we advise that you only use this precious resource one time (or at most, twice).

The purpose of the testing data is to estimate how your final model will perform on new data; we set aside a proportion of the data available and pretend that it is not available to us for training the model so we can use it to estimate model performance on strictly out-of-sample data. Often during the process of modeling, we want to compare models or different model parameters. If we use the test set for these kinds of tasks, we risk fooling ourselves that we are doing better than we really are.

Another option for evaluating models is to predict one time on the training set to measure performance. This is the same data that was used to train the model, however, and evaluating on the training data often results in performance estimates that are too optimistic. This is especially true for powerful machine learning algorithms that can learn subtle patterns from data; we risk overfitting to the training set.

Yet another option for evaluating or comparing models is to use a separate validation set. In this situation, we split our data not into two sets (training and testing) but into three sets (testing, training, and validation). The validation set is used for computing performance metrics to compare models or model parameters. This can be a great option if you have enough data for it, but often we as machine learning practitioners are not so lucky.

What are we to do, then, if we want to train multiple models and find the best one? Or compute a reliable estimate for how our model has performed without wasting the valuable testing set? We can use resampling. When we resample, we create new simulated datasets from the training set for the purpose of, for example, measuring model performance.

Let’s estimate the performance of the lasso regression model we just fit. We can do this using resampled datasets built from the training set. Let’s create cross 10-fold cross-validation sets, and use these resampled sets for performance estimates.

set.seed(123)
scotus_folds <- vfold_cv(scotus_train)

scotus_folds
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [6.8K/750]> Fold01
##  2 <split [6.8K/750]> Fold02
##  3 <split [6.8K/750]> Fold03
##  4 <split [6.8K/750]> Fold04
##  5 <split [6.8K/750]> Fold05
##  6 <split [6.8K/750]> Fold06
##  7 <split [6.8K/750]> Fold07
##  8 <split [6.8K/750]> Fold08
##  9 <split [6.8K/750]> Fold09
## 10 <split [6.8K/750]> Fold10

Each of these “splits” contains information about how to create cross-validation folds from the original training data. In this example, 90% of the training data is included in each fold for analysis and the other 10% is held out for assessment. Since we used cross-validation, each Supreme Court opinion appears in only one of these held-out assessment sets.

In Section 6.1.1, we fit one time to the training data as a whole. Now, to estimate how well that model performs, let’s fit many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.

set.seed(123)
lasso_rs <- fit_resamples(
  scotus_wf %>% add_model(lasso_spec),
  scotus_folds,
  control = control_resamples(save_pred = TRUE)
)

lasso_rs
## # Resampling results
## # 10-fold cross-validation 
## # A tibble: 10 x 5
##    splits             id     .metrics         .notes           .predictions     
##    <list>             <chr>  <list>           <list>           <list>           
##  1 <split [6.8K/750]> Fold01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  2 <split [6.8K/750]> Fold02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  3 <split [6.8K/750]> Fold03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  4 <split [6.8K/750]> Fold04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  5 <split [6.8K/750]> Fold05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  6 <split [6.8K/750]> Fold06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  7 <split [6.8K/750]> Fold07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  8 <split [6.8K/750]> Fold08 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
##  9 <split [6.8K/750]> Fold09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…
## 10 <split [6.8K/750]> Fold10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 4…

These results look a lot like the resamples, but they have some additional columns, like the .metrics that we can use to measure how well this model performed and the .predictions we can use to explore that performance more deeply. What results do we see, in terms of performance metrics?

lasso_rs %>%
  collect_metrics()
## # A tibble: 2 x 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   15.7      10 0.204   Preprocessor1_Model1
## 2 rsq     standard    0.893    10 0.00192 Preprocessor1_Model1

The default performance metrics to be computed for regression models are RMSE (root mean squared error) and \(R^2\). RMSE is a metric that is in the same units as the original data, so in units of years, in our case; the RMSE of this first regression model is 15.7 years.

The lower RMSE is, the better; the closer \(R^2\) is to one, the better.

These values are quantitative estimates for how well our model performed, and can be compared across different kinds of models. Figure 6.2 shows the predicted years for these Supreme Court opinions plotted against the true years when they were published, for all the resampled datasets.

lasso_rs %>%
  collect_predictions() %>%
  ggplot(aes(year, .pred, color = id)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Truth",
    y = "Predicted year",
    color = NULL,
    title = "Predicted and true years for Supreme Court opinions",
    subtitle = "Each cross-validation fold is shown in a different color"
  )
Most Supreme Court opinions are near the dashed line, indicating good agreement between our lasso regression predictions and the real years

FIGURE 6.2: Most Supreme Court opinions are near the dashed line, indicating good agreement between our lasso regression predictions and the real years

The average spread of points in this plot above and below the dashed line corresponds to RMSE, which is 15.7 years for this model. When RMSE is better (lower), the points will be closer to the dashed line. This first model we have tried did not do a great job for Supreme Court opinions from before 1850, but for opinions after 1850, this looks pretty good!

6.2 Compare to the null model

One way to assess a model like this one is to compare its performance to a “null model.”

A null model is a simple, non-informative model that always predicts the largest class (for classification) or the mean (such as the mean year of Supreme Court opinions, in our specific regression case)1.

We can use the same function fit_resamples() and the same preprocessing recipe as before, switching out our lasso model specification for the null_model() specification.

null_regression <- null_model() %>%
  set_engine("parsnip") %>%
  set_mode("regression")

null_rs <- fit_resamples(
  scotus_wf %>% add_model(null_regression),
  scotus_folds,
  metrics = metric_set(rmse)
)

null_rs
## # Resampling results
## # 10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     .metrics         .notes          
##    <list>             <chr>  <list>           <list>          
##  1 <split [6.8K/750]> Fold01 <tibble [1 × 4]> <tibble [0 × 1]>
##  2 <split [6.8K/750]> Fold02 <tibble [1 × 4]> <tibble [0 × 1]>
##  3 <split [6.8K/750]> Fold03 <tibble [1 × 4]> <tibble [0 × 1]>
##  4 <split [6.8K/750]> Fold04 <tibble [1 × 4]> <tibble [0 × 1]>
##  5 <split [6.8K/750]> Fold05 <tibble [1 × 4]> <tibble [0 × 1]>
##  6 <split [6.8K/750]> Fold06 <tibble [1 × 4]> <tibble [0 × 1]>
##  7 <split [6.8K/750]> Fold07 <tibble [1 × 4]> <tibble [0 × 1]>
##  8 <split [6.8K/750]> Fold08 <tibble [1 × 4]> <tibble [0 × 1]>
##  9 <split [6.8K/750]> Fold09 <tibble [1 × 4]> <tibble [0 × 1]>
## 10 <split [6.8K/750]> Fold10 <tibble [1 × 4]> <tibble [0 × 1]>

What results do we obtain from the null model, in terms of performance metrics?

null_rs %>%
  collect_metrics()
## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard    48.0    10   0.512 Preprocessor1_Model1

The RMSE indicates that this null model is dramatically worse than our first model. Even our first very attempt at a regression model (using only unigrams and no tuning) did much better than the null model; the text of the Supreme Court opinions has enough information in it related to the year the opinions were published that we can build successful models.

6.3 Tuning lasso hyperparameters

The value penalty = 0.1 for regularization in Section 6.1.1 was picked somewhat arbitrarily. How do we know the right or best regularization parameter penalty? This is a model hyperparameter and we cannot learn its best value during model training, but we can estimate the best value by training many models on resampled data sets and exploring how well all these models perform. Let’s build a new model specification for model tuning.

tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

tune_spec
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

After the tuning process, we can select a single best numeric value.

Think of tune() here as a placeholder for the regularization penalty.

We can create a regular grid of values to try, using a convenience function for penalty().

lambda_grid <- grid_regular(penalty(), levels = 30)

lambda_grid
## # A tibble: 30 x 1
##     penalty
##       <dbl>
##  1 1.00e-10
##  2 2.21e-10
##  3 4.89e-10
##  4 1.08e- 9
##  5 2.40e- 9
##  6 5.30e- 9
##  7 1.17e- 8
##  8 2.59e- 8
##  9 5.74e- 8
## 10 1.27e- 7
## # … with 20 more rows

The function grid_regular() is from the dials package. It chooses sensible values to try for a parameter like the regularization penalty; here, we asked for 30 different possible values.

Now it is time to tune! Let’s use tune_grid() to fit a model at each of the values for the regularization penalty in our regular grid.

Tuning a model uses a similar syntax compared to fitting a model to a set of resampled datasets for the purposes of evaluation (fit_resamples()) because the two tasks are so similar. The difference is that when you tune, each model that you fit has different parameters and you want to find the best one.

We add our tunable model specification tune_spec to the same workflow we’ve been using so far that contains the preprocessing recipe.

set.seed(2020)
tune_rs <- tune_grid(
  scotus_wf %>% add_model(tune_spec),
  scotus_folds,
  grid = lambda_grid,
  control = control_resamples(save_pred = TRUE)
)

tune_rs

Now, instead of one set of metrics, we have a set of metrics for each value of the regularization penalty.

tune_rs %>%
  collect_metrics()
## # A tibble: 60 x 7
##     penalty .metric .estimator   mean     n std_err .config              
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1 1.00e-10 rmse    standard   16.1      10 0.195   Preprocessor1_Model01
##  2 1.00e-10 rsq     standard    0.889    10 0.00181 Preprocessor1_Model01
##  3 2.21e-10 rmse    standard   16.1      10 0.195   Preprocessor1_Model02
##  4 2.21e-10 rsq     standard    0.889    10 0.00181 Preprocessor1_Model02
##  5 4.89e-10 rmse    standard   16.1      10 0.195   Preprocessor1_Model03
##  6 4.89e-10 rsq     standard    0.889    10 0.00181 Preprocessor1_Model03
##  7 1.08e- 9 rmse    standard   16.1      10 0.195   Preprocessor1_Model04
##  8 1.08e- 9 rsq     standard    0.889    10 0.00181 Preprocessor1_Model04
##  9 2.40e- 9 rmse    standard   16.1      10 0.195   Preprocessor1_Model05
## 10 2.40e- 9 rsq     standard    0.889    10 0.00181 Preprocessor1_Model05
## # … with 50 more rows

Let’s visualize these metrics, RMSE and \(R^2\), in Figure 6.3 to see what the best model is.

tune_rs %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none") +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identity the best penalty"
  )
We can identify the best regularization penalty from model performance metrics, for example, at the lowest RMSE

FIGURE 6.3: We can identify the best regularization penalty from model performance metrics, for example, at the lowest RMSE

We can view the best results with show_best() and a choice for the metric, such as RMSE.

tune_rs %>%
  show_best("rmse")
## # A tibble: 5 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.0924  rmse    standard    15.8    10   0.203 Preprocessor1_Model27
## 2 0.204   rmse    standard    15.8    10   0.214 Preprocessor1_Model28
## 3 0.0418  rmse    standard    15.9    10   0.195 Preprocessor1_Model26
## 4 0.0189  rmse    standard    16.0    10   0.194 Preprocessor1_Model25
## 5 0.00853 rmse    standard    16.1    10   0.195 Preprocessor1_Model24
tune_rs_rmse <- show_best(tune_rs, "rmse") %>%
  pull(mean) %>%
  min() %>%
  round(1)

The best value for RMSE from this tuning run is 15.8. We can extract the best regularization parameter for this value of RMSE from our tuning results with select_best().

lowest_rmse <- tune_rs %>%
  select_best("rmse")

lowest_rmse
## # A tibble: 1 x 2
##   penalty .config              
##     <dbl> <chr>                
## 1  0.0924 Preprocessor1_Model27

Next, let’s finalize our tunable workflow with this particular regularization penalty. This is the regularization penalty that our tuning results indicate give us the best model.

final_lasso <- finalize_workflow(
  scotus_wf %>% add_model(tune_spec),
  lowest_rmse
)

final_lasso
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## ● step_tokenize()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.0923670857187388
##   mixture = 1
## 
## Computational engine: glmnet

Instead of penalty = tune() like before, now our workflow has finalized values for all arguments. The preprocessing recipe has been evaluated on the training data, and we tuned the regularization penalty so that we have a penalty value of 0.092. This workflow is ready to go! It can now be applied to new data.

6.4 Compare to a random forest model

Random forest models are broadly used in predictive modeling contexts because they are low-maintenance and perform well. For example, see Caruana, Karampatziakis, and Yessenalina (2008) and Olson et al. (2017) for comparisons of the performance of common models such as random forest, decision tree, support vector machines, etc. trained on benchmark datasets; random forest models were one of the best overall. Let’s see how a random forest model performs with our dataset of Supreme Court opinions.

First, let’s build a random forest model specification, using the ranger implementation. Random forest models are known for performing well without tuning, so we will just make sure we have enough trees.

rf_spec <- rand_forest(trees = 1000) %>%
  set_engine("ranger") %>%
  set_mode("regression")

rf_spec
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 1000
## 
## Computational engine: ranger

Now we can fit this random forest model. Let’s use fit_resamples() again, so we can evaluate the model performance. We will use three arguments to this function:

  • Our modeling workflow(), with the same preprocessing recipe we have been using so far in this chapter plus our new random forest model specification
  • Our cross-validation resamples of the Supreme Court opinions
  • A control argument to specify that we want to keep the predictions, to explore after fitting
rf_rs <- fit_resamples(
  scotus_wf %>% add_model(rf_spec),
  scotus_folds,
  control = control_resamples(save_pred = TRUE)
)

We can use collect_metrics() to obtain and format the performance metrics for this random forest model.

collect_metrics(rf_rs)
## # A tibble: 2 x 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   15.0      10 0.483   Preprocessor1_Model1
## 2 rsq     standard    0.919    10 0.00428 Preprocessor1_Model1

This looks pretty promising, so let’s explore the predictions for this random forest model.

rf_rs %>%
  collect_predictions() %>%
  ggplot(aes(year, .pred, color = id)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Truth",
    y = "Predicted year",
    color = NULL,
    title = "Predicted and true years for Supreme Court opinions using a random forest model",
    subtitle = "Each cross-validation fold is shown in a different color"
  )
The random forest model did not perform very sensibly across years, compared to our first attempt using a linear model with lasso regularization

FIGURE 6.4: The random forest model did not perform very sensibly across years, compared to our first attempt using a linear model with lasso regularization

Figure 6.4 shows some of the strange behavior from our fitted model. The overall performance metrics look pretty good, but predictions are too high and too low around certain threshold years.

It is very common to run into problems when using tree-based models like random forests with text data. One of the defining characteristics of text data is that it is sparse, with many features but most features not occurring in most observations. Tree-based models such as random forests are often not well-suited to sparse data because of how decision trees model outcomes (Tang, Garreau, and Luxburg 2018).

Models that work best with text tend to be models designed for or otherwise appropriate for sparse data.

Algorithms that work well with sparse data are less important when text has been transformed to a non-sparse representation such as with word embeddings (Chapter 5).

6.5 Case study: sparse encoding

Speaking of sparse data, we can change how our text data is represented to take advantage of this sparsity. The regularized regression model we have been training in previous sections used set_engine("glmnet"); this computational engine can be more efficient when text data is transformed to a sparse matrix, rather than a dense data frame or tibble representation.

To keep our text data sparse in the sense of having many zero values, we need to create a slightly different preprocessing recipe.

tfidf_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 1e3) %>%
  step_tfidf(text)

tfidf_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for text
## Text filtering for text
## Term frequency-inverse document frequency with text

How is this recipe different from scotus_rec? It does not include a step to center and scale the predictors with step_normalize(), instead relying on the “normalization” of weighting by tf-idf. Regularized models do require that all predictors are on the same scale, but weighting by tf-idf can often provide this for text data, without a final centering and scaling.

The next step to use the sparse capabilities of set_engine("glmnet") is to explicitly set a non-default preprocessing blueprint, using the package hardhat.

The hardhat package is used by other tidymodels packages like recipes and parsnip under the hood. As a tidymodels user, you typically don’t use hardhat functions directly. The exception is when you need to customize something about your model or preprocessing, like in this sparse data example.

library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model. The composition "dgCMatrix" is the most common sparse matrix type, from the Matrix package (Bates and Maechler 2019), used in R for modeling. We can use this blueprint argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.

sparse_wf <- workflow() %>%
  add_recipe(tfidf_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

sparse_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_tokenize()
## ● step_tokenfilter()
## ● step_tfidf()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels. Let’s restrict the values this time using the range argument, so we don’t test out as small values for regularization, and only try 20 levels.

grid_twenty <- grid_regular(penalty(range = c(-5, 0)), levels = 20)

grid_twenty
## # A tibble: 20 x 1
##      penalty
##        <dbl>
##  1 0.00001  
##  2 0.0000183
##  3 0.0000336
##  4 0.0000616
##  5 0.000113 
##  6 0.000207 
##  7 0.000379 
##  8 0.000695 
##  9 0.00127  
## 10 0.00234  
## 11 0.00428  
## 12 0.00785  
## 13 0.0144   
## 14 0.0264   
## 15 0.0483   
## 16 0.0886   
## 17 0.162    
## 18 0.298    
## 19 0.546    
## 20 1

We can tune this lasso regression model, in the same way that we did in Section 6.3. We will fit and assess each possible regularization parameter on each resampling fold, to find the best amount of regularization.

set.seed(2020)
sparse_rs <- tune_grid(
  sparse_wf,
  scotus_folds,
  grid = grid_twenty
)

sparse_rs

How did this model turn out, especially compared to the tuned model that included centering/scaling and did not use the sparse capabilities of set_engine("glmnet")?

sparse_rs %>%
  show_best("rmse")
## # A tibble: 5 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1  0.0886 rmse    standard    15.8    10   0.202 Preprocessor1_Model16
## 2  0.162  rmse    standard    15.8    10   0.211 Preprocessor1_Model17
## 3  0.0483 rmse    standard    15.8    10   0.195 Preprocessor1_Model15
## 4  0.0264 rmse    standard    15.9    10   0.194 Preprocessor1_Model14
## 5  0.0144 rmse    standard    16.0    10   0.194 Preprocessor1_Model13

The best RMSE is nearly identical; the best RMSE for the non-sparse tuned lasso model in Section 6.3 was 15.8. The best regularization parameter (penalty) is a little different (the best value in Section 6.3 was 0.092) but we used a different grid so didn’t try out exactly the same values. We ended up with nearly the same performance and best tuned model.

Importantly, this tuning also took a bit less time to complete.

  • The preprocessing was only a little bit faster, because tokenization and computing tf-idf take a long time, compared to step_normalize() that we removed.
  • The model fitting was much faster, because for highly sparse data, this implementation of regularized regression is much faster for sparse matrix input than any dense input.

Overall, the whole tuning workflow is about 10% faster using the sparse preprocessing blueprint. Depending on how computationally expensive your preprocessing is relative to your model and how sparse your data is, you may expect to see larger (or smaller) gains from moving to a sparse data representation.

Since our model performance is about the same and we see gains in training time, let’s use this sparse representation for the rest of this chapter.

6.6 Case study: removing stop words

We did not remove stop words (Chapter 3) in any of our models so far in this chapter. What impact will removing stop words have, and how do we know which stop word list is the best to use? The best way to answer these questions is with experimentation.

Removing stop words is part of data preprocessing, so we define this step as part of our preprocessing recipe. Let’s use the best model we’ve found so far (the regularized regression model from Section 6.3) and switch in a different recipe in our modeling workflow.

Let’s build a small recipe wrapper helper function so we can pass a value stopword_name to step_stopwords().

stopword_rec <- function(stopword_name) {
  recipe(year ~ text, data = scotus_train) %>%
    step_tokenize(text) %>%
    step_stopwords(text, stopword_source = stopword_name) %>%
    step_tokenfilter(text, max_tokens = 1e3) %>%
    step_tfidf(text)
}

For example, now we can create a recipe that removes the Snowball stop words list by calling this function.

stopword_rec("snowball")
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for text
## Stop word removal for text
## Text filtering for text
## Term frequency-inverse document frequency with text

Next, let’s set up a new workflow that has a model only, using add_model(). We start with the empty workflow() and then add our tunable regularized regression model.

tunable_wf <- workflow() %>%
  add_model(tune_spec)

tunable_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: None
## Model: linear_reg()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Notice that for this workflow, there is no preprocessor yet: Preprocessor: None. This workflow uses the same tunable lasso model specification that we used in Section 6.3 but we are going to combine several different preprocessing recipes with it, one for each stop word lexicon we want to try.

Now we can put this all together and tune these models which include stop word removal. We could create a little helper function for tuning like we did for the recipe, but we have printed out all three calls to tune_grid() for extra clarity. Notice for each one that there are three arguments:

  • A tunable workflow, which consists of the tunable lasso model specification and a data preprocessing recipe with stop word removal
  • The same cross-validation folds we created earlier
  • The same grid of values for the regularization parameter that we used in the previous case study
set.seed(123)
snowball_rs <- tune_grid(
  tunable_wf %>% add_recipe(stopword_rec("snowball"), blueprint = sparse_bp),
  scotus_folds,
  grid = grid_twenty
)

set.seed(234)
smart_rs <- tune_grid(
  tunable_wf %>% add_recipe(stopword_rec("smart"), blueprint = sparse_bp),
  scotus_folds,
  grid = grid_twenty
)

set.seed(345)
stopwords_iso_rs <- tune_grid(
  tunable_wf %>% add_recipe(stopword_rec("stopwords-iso"), blueprint = sparse_bp),
  scotus_folds,
  grid = grid_twenty
)

After fitting models for each of possible regularization values to each of the cross-validation folds, these sets of results contain metrics computed for removing that set of stop words.

collect_metrics(smart_rs)
## # A tibble: 40 x 7
##      penalty .metric .estimator   mean     n std_err .config              
##        <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   16.5      10 0.187   Preprocessor1_Model01
##  2 0.00001   rsq     standard    0.883    10 0.00215 Preprocessor1_Model01
##  3 0.0000183 rmse    standard   16.5      10 0.187   Preprocessor1_Model02
##  4 0.0000183 rsq     standard    0.883    10 0.00215 Preprocessor1_Model02
##  5 0.0000336 rmse    standard   16.5      10 0.187   Preprocessor1_Model03
##  6 0.0000336 rsq     standard    0.883    10 0.00215 Preprocessor1_Model03
##  7 0.0000616 rmse    standard   16.5      10 0.187   Preprocessor1_Model04
##  8 0.0000616 rsq     standard    0.883    10 0.00215 Preprocessor1_Model04
##  9 0.000113  rmse    standard   16.5      10 0.187   Preprocessor1_Model05
## 10 0.000113  rsq     standard    0.883    10 0.00215 Preprocessor1_Model05
## # … with 30 more rows

We can explore whether one of these sets of stop words performed better than the others by comparing the performance, for example in terms of RMSE as shown Figure 6.5. This plot shows the five best models for each set of stop words, using show_best() applied to each via purrr::map_dfr().

word_counts <- tibble(name = c("snowball", "smart", "stopwords-iso")) %>%
  mutate(words = map_int(name, ~ length(stopwords::stopwords(source = .))))

list(
  snowball = snowball_rs,
  smart = smart_rs,
  `stopwords-iso` = stopwords_iso_rs
) %>%
  map_dfr(show_best, "rmse", .id = "name") %>%
  left_join(word_counts) %>%
  mutate(name = paste0(name, " (", words, " words)")) %>%
  ggplot(aes(fct_reorder(name, words), mean, color = name)) +
  geom_point(size = 3, alpha = 0.8, show.legend = FALSE) +
  labs(
    x = NULL, y = "mean RMSE for five best models",
    title = "Model performance for three stop word lexicons",
    subtitle = "For this dataset, the Snowball lexicon performed best"
  )
Comparing model performance for predicting the year of Supreme Court opinions with three different stop word lexicons

FIGURE 6.5: Comparing model performance for predicting the year of Supreme Court opinions with three different stop word lexicons

The Snowball lexicon contains the smallest number of words (see Figure 3.1) and, in this case, results in the best performance. Removing fewer stop words results in the best performance.

This result is not generalizable to all data sets and contexts, but the approach outlined in this section is generalizable.

This approach can be used to compare different lexicons and find the best one for a specific data set and model. Notice how the results for smart and stopword-iso are worse than removing no stopwords at all (see Figure 6.3). This indicates that removing a small stop word list is a good choice.

This increase in performance isn’t huge, but removing stop words isn’t computationally slow or difficult so the cost for this improvement is low.

6.7 Case study: varying n-grams

Each model trained so far in this chapter has involved single words or unigrams, but using n-grams (Section 2.2.3) can integrate different kinds of information into a model. Bigrams and trigrams (or even higher order n-grams) capture concepts that span single words, as well as effects from word order, that can be predictive.

This is another part of data preprocessing, so we again define this step as part of our preprocessing recipe. Let’s build another small recipe wrapper helper function so we can pass a list of options ngram_options to step_tokenize(). We’ll use it with the same model as the previous section.

ngram_rec <- function(ngram_options) {
  recipe(year ~ text, data = scotus_train) %>%
    step_tokenize(text, token = "ngrams", options = ngram_options) %>%
    step_tokenfilter(text, max_tokens = 1e3) %>%
    step_tfidf(text)
}

There are two options we can specify, n and n_min, when we are using engine = "tokenizers". We can set up a recipe with only n = 1 to tokenize and only extract the unigrams.

ngram_rec(list(n = 1))

We can use n = 3, n_min = 1 to identify the set of all trigrams, bigrams, and unigrams.

ngram_rec(list(n = 3, n_min = 1))

Including n-grams of different orders in a model (such as trigrams, bigrams, plus unigrams) allows the model to learn at different levels of linguistic organization and context.

We can reuse the same components tunable_wf and grid_twenty from our earlier case study; these modular components are a benefit to adopting this approach to supervised machine learning. These model components provide the tunable lasso model specification and possible regularization parameters to try. Let’s put it all together and create a helper function to use tune_grid() with these components plus our helper recipe function.

tune_ngram <- function(ngram_options) {
  tune_grid(
    tunable_wf %>% add_recipe(ngram_rec(ngram_options), blueprint = sparse_bp),
    scotus_folds,
    grid = grid_twenty
  )
}
We could have created this type of small function for trying out different stop word lexicons in Section 6.6, but there we showed each call to tune_grid() for extra clarity.

With this helper function, let’s try out predicting the year of Supreme Court opinions using:

  • only unigrams
  • bigrams and unigrams
  • trigrams, bigrams, and unigrams
set.seed(123)
unigram_rs <- tune_ngram(list(n = 1))

set.seed(234)
bigram_rs <- tune_ngram(list(n = 2, n_min = 1))

set.seed(345)
trigram_rs <- tune_ngram(list(n = 3, n_min = 1))

These sets of results contain metrics computed for the model with that tokenization strategy.

collect_metrics(bigram_rs)
## # A tibble: 40 x 7
##      penalty .metric .estimator   mean     n std_err .config              
##        <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   15.8      10 0.200   Preprocessor1_Model01
##  2 0.00001   rsq     standard    0.892    10 0.00219 Preprocessor1_Model01
##  3 0.0000183 rmse    standard   15.8      10 0.200   Preprocessor1_Model02
##  4 0.0000183 rsq     standard    0.892    10 0.00219 Preprocessor1_Model02
##  5 0.0000336 rmse    standard   15.8      10 0.200   Preprocessor1_Model03
##  6 0.0000336 rsq     standard    0.892    10 0.00219 Preprocessor1_Model03
##  7 0.0000616 rmse    standard   15.8      10 0.200   Preprocessor1_Model04
##  8 0.0000616 rsq     standard    0.892    10 0.00219 Preprocessor1_Model04
##  9 0.000113  rmse    standard   15.8      10 0.200   Preprocessor1_Model05
## 10 0.000113  rsq     standard    0.892    10 0.00219 Preprocessor1_Model05
## # … with 30 more rows

We can compare the performance of these models in terms of RMSE as shown Figure 6.6. Instead of looking at the top 5 best-performing models with show_best() as in Figure 6.5, let’s look at all the models we trained and make a dot plot.

list(
  `1` = unigram_rs,
  `1 and 2` = bigram_rs,
  `1, 2, and 3` = trigram_rs
) %>%
  map_dfr(collect_metrics, .id = "name") %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(name, mean, fill = name)) +
  geom_dotplot(
    binaxis = "y", stackdir = "center", binpositions = "all",
    show.legend = FALSE
  ) +
  labs(
    x = "Degree of n-grams", y = "mean RMSE",
    title = "Model performance for different degrees of n-gram tokenization",
    subtitle = "For the same number of tokens, bigrams plus unigrams performed best"
  )
Comparing model performance for predicting the year of Supreme Court opinions with three different degrees of n-grams

FIGURE 6.6: Comparing model performance for predicting the year of Supreme Court opinions with three different degrees of n-grams

Each of these models was trained with max_tokens = 1e3, i.e., including only the top 1000 tokens for each tokenization strategy. Holding the number of tokens constant, using bigrams plus unigrams performs best for this corpus of Supreme Court opinions. The performance gain in moving from unigrams to unigrams plus bigrams is significant, but adding in trigrams doesn’t change the situation as much.

Keep in mind that adding n-grams is computationally expensive, especially compared to the improvement in model performance gained. We can benchmark the whole model workflow, including preprocessing and modeling. Using bigrams plus unigrams takes more than twice as long to train than only unigrams, and adding in trigrams as well takes almost five times as long as training on unigrams alone.

6.8 Case study: lemmatization

As we discussed in Section 4.6, we can normalize words to their roots or lemmas based on each word’s context and the structure of a language. Table 6.1 shows both the original words and the lemmas for one sentence from a Supreme Court opinion, using lemmatization implemented via the spaCy library as made available through the spacyr R package (Benoit and Matsuo 2019).

TABLE 6.1: Lemmatization of one sentence from a Supreme Court opinion
original word lemma
However however
, ,
the the
Court Court
of of
Appeals Appeals
disagreed disagree
with with
the the
District District
Court Court
’s ’s
construction construction
of of
the the
state state
statute statute
, ,
concluding conclude
that that
it -PRON-
did do
authorize authorize
issuance issuance
of of
the the
orders order
to to
withhold withhold
to to
the the
Postal Postal
Service Service
. .

Notice several things about lemmatization that are different from the kind of default tokenization (Chapter 2) you may be more familiar with.

  • Words are converted to lower case except for proper nouns.
  • The lemma for pronouns is -PRON-.
  • Irregular verbs are converted to their canonical form (“did” to “do”).

Using lemmatization instead of a more straightforward tokenization strategy is slower because of the increased complexity involved, but it can be worth it. Let’s explore how to train a model using lemmas instead of words.

Lemmatization is, like choices around n-grams and stop words, part of data preprocessing so we define how to set up lemmatization as part of our preprocessing recipe. We use engine = "spacyr" for tokenization (instead of the default) and add step_lemma() to our preprocessing. This step extracts the lemmas from the parsing done by the tokenization engine.

spacyr::spacy_initialize(entity = FALSE)
## NULL
lemma_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(text, engine = "spacyr") %>%
  step_lemma(text) %>%
  step_tokenfilter(text, max_tokens = 1e3) %>%
  step_tfidf(text)

lemma_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for text
## Lemmatization for text
## Text filtering for text
## Term frequency-inverse document frequency with text

Let’s combine this lemmatized text with our tunable workflow and grid of possible parameter values. We can then tune our workflow and identify the best model for the lemmatized text.

set.seed(123)
lemma_rs <- tune_grid(
  tunable_wf %>% add_recipe(lemma_rec, blueprint = sparse_bp),
  scotus_folds,
  grid = grid_twenty
)

Let’s visualize the performance metrics, RMSE and \(R^2\), in Figure 6.7 to see what the best model using lemmas is, much like we did in Figure 6.3.

The best model using lemmatization is better than the best model without.

FIGURE 6.7: The best model using lemmatization is better than the best model without.

What is the best model, using lemmatization?

lemma_rs %>%
  show_best("rmse")
## # A tibble: 5 x 7
##   penalty .metric .estimator  mean     n std_err .config
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
## 1  0.0886 rmse    standard    14.6    10   0.411 Model16
## 2  0.162  rmse    standard    14.6    10   0.384 Model17
## 3  0.0483 rmse    standard    14.7    10   0.429 Model15
## 4  0.0264 rmse    standard    14.7    10   0.438 Model14
## 5  0.0144 rmse    standard    14.8    10   0.436 Model13

The best value for RMSE at 14.59 shows us that using lemmatization can have a significant benefit for model performance, compared to 15.8 from tuning a non-lemmatized lasso model in Section 6.3. However, this comes at a cost of much slower training because of the procedure involved in identifying lemmas; adding step_lemma() to our preprocessing increases the overall time to train the workflow by over tenfold.

6.9 What evaluation metrics are appropriate?

We have focused on using RMSE and \(R^2\) as metrics for our models in this chapter, the defaults in the tidymodels framework. Other metrics can also be appropriate for regression models. Another common set of regression metric options are the various flavors of mean absolute error.

If you know before you fit your model that you want to compute one or more of these metrics, you can specify them in a call to metric_set(). Let’s set up a tuning grid for mean absolute error (mae) and mean absolute percent error (mape).

lemma_rs <- tune_grid(
  tunable_wf %>% add_recipe(lemma_rec),
  scotus_folds,
  grid = grid_twenty,
  metrics = metric_set(mae, mape)
)

If you have already fit your model, you can still compute and explore non-default metrics as long as you saved the predictions for your resampled datasets using control_resamples(save_pred = TRUE).

Let’s go back to the lasso model we tuned in Section 6.3, with results in tune_rs. We can compute the overall mean absolute percent error.

tune_rs %>%
  collect_predictions() %>%
  mape(year, .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard       0.631

We can also compute the mean absolute percent error for each resample.

tune_rs %>%
  collect_predictions() %>%
  group_by(id) %>%
  mape(year, .pred)
## # A tibble: 10 x 4
##    id     .metric .estimator .estimate
##    <chr>  <chr>   <chr>          <dbl>
##  1 Fold01 mape    standard       0.638
##  2 Fold02 mape    standard       0.628
##  3 Fold03 mape    standard       0.657
##  4 Fold04 mape    standard       0.615
##  5 Fold05 mape    standard       0.608
##  6 Fold06 mape    standard       0.638
##  7 Fold07 mape    standard       0.636
##  8 Fold08 mape    standard       0.609
##  9 Fold09 mape    standard       0.635
## 10 Fold10 mape    standard       0.646

Similarly, we can do the same for the mean absolute error, which gives a result in units of the original data (years, in this case) instead of relative units.

tune_rs %>%
  collect_predictions() %>%
  group_by(id) %>%
  mae(year, .pred)
## # A tibble: 10 x 4
##    id     .metric .estimator .estimate
##    <chr>  <chr>   <chr>          <dbl>
##  1 Fold01 mae     standard        12.2
##  2 Fold02 mae     standard        12.0
##  3 Fold03 mae     standard        12.6
##  4 Fold04 mae     standard        11.8
##  5 Fold05 mae     standard        11.6
##  6 Fold06 mae     standard        12.2
##  7 Fold07 mae     standard        12.2
##  8 Fold08 mae     standard        11.7
##  9 Fold09 mae     standard        12.1
## 10 Fold10 mae     standard        12.4

For the full set of regression metric options, see the yardstick documentation.

6.10 The full game: regression

In this chapter, we started from the beginning and then explored both different types of models and different data preprocessing steps. Let’s take a step back and build one final model, using everything we’ve learned. For our final model, let’s again use a lasso regression model, since it performed better than the other options we looked at. We will:

  • train on the same set of cross-validation resamples used throughout this chapter,
  • tune both the lasso regularization parameter and the number of tokens used in the model,
  • include both unigrams and bigrams,
  • remove the Snowball stop word lexicon,
  • choose not to use lemmatization, to demonstrate what is possible for situations when training time makes lemmatization an impractical choice, and
  • finally evaluate on the testing set, which we have not touched at all yet.

6.10.1 Preprocess the data

First, let’s create the data preprocessing recipe. By setting the tokenization options to list(n = 2, n_min = 1), we will include both unigrams and bigrams in our model.

We can remove the Snowball lexicon of stop words from the text with the option stopwords = stopwords::stopwords(source = “snowball”); this will remove the stop words before tokenizing so that neither the unigrams or bigrams will include these stop words.

When we set max_tokens = tune(), we can train multiple models with different numbers of maximum tokens and then compare these models’ performance to choose the best value.

final_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(
    text,
    token = "ngrams",
    options = list(
      n = 2, n_min = 1,
      stopwords = stopwords::stopwords(source = "snowball")
    )
  ) %>%
  step_tokenfilter(text, max_tokens = tune()) %>%
  step_tfidf(text)

final_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Tokenization for text
## Text filtering for text
## Term frequency-inverse document frequency with text

6.10.2 Specify the model

Let’s use the same lasso regression model specification we have used multiple times in this chapter, and set it up here again to remind ourselves. We will tune() the regularization penalty to find the best value.

tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

tune_spec
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

We can combine the preprocessing recipe and the model specification in a tunable workflow.

tune_wf <- workflow() %>%
  add_recipe(final_rec, blueprint = sparse_bp) %>%
  add_model(tune_spec)

tune_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_tokenize()
## ● step_tokenfilter()
## ● step_tfidf()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

6.10.3 Tune the model

Before we tune the model, we need to set up a set of possible parameter values to try.

There are two tunable parameters in this model, the regularization parameter and the maximum number of tokens included in the model.

Let’s include different possible values for each parameter, for a combination of 120 models.

final_grid <- grid_regular(
  penalty(range = c(-4, 0)),
  max_tokens(range = c(1e3, 6e3)),
  levels = c(penalty = 20, max_tokens = 6)
)


final_grid
## # A tibble: 120 x 2
##     penalty max_tokens
##       <dbl>      <int>
##  1 0.0001         1000
##  2 0.000162       1000
##  3 0.000264       1000
##  4 0.000428       1000
##  5 0.000695       1000
##  6 0.00113        1000
##  7 0.00183        1000
##  8 0.00298        1000
##  9 0.00483        1000
## 10 0.00785        1000
## # … with 110 more rows

Throughout this chapter, we have used grid_regular() where we fit a model at every combination of parameters, but if you have a model with many tuning parameters, you may wish to try a space-filling grid instead, such as grid_max_entropy() or grid_latin_hypercube().

Now it’s time to set up our tuning grid. Let’s save the predictions so we can explore them in more detail, and let’s also set custom metrics instead of using the defaults. Let’s compute RMSE, mean absolute error, and mean absolute percent error during tuning.

final_rs <- tune_grid(
  tune_wf,
  scotus_folds,
  grid = final_grid,
  metrics = metric_set(rmse, mae, mape)
)

final_rs
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     .metrics           .notes          
##    <list>             <chr>  <list>             <list>          
##  1 <split [6.8K/750]> Fold01 <tibble [360 × 6]> <tibble [0 × 1]>
##  2 <split [6.8K/750]> Fold02 <tibble [360 × 6]> <tibble [0 × 1]>
##  3 <split [6.8K/750]> Fold03 <tibble [360 × 6]> <tibble [0 × 1]>
##  4 <split [6.8K/750]> Fold04 <tibble [360 × 6]> <tibble [0 × 1]>
##  5 <split [6.8K/750]> Fold05 <tibble [360 × 6]> <tibble [0 × 1]>
##  6 <split [6.8K/750]> Fold06 <tibble [360 × 6]> <tibble [0 × 1]>
##  7 <split [6.8K/750]> Fold07 <tibble [360 × 6]> <tibble [0 × 1]>
##  8 <split [6.8K/750]> Fold08 <tibble [360 × 6]> <tibble [0 × 1]>
##  9 <split [6.8K/750]> Fold09 <tibble [360 × 6]> <tibble [0 × 1]>
## 10 <split [6.8K/750]> Fold10 <tibble [360 × 6]> <tibble [0 × 1]>

We trained all these models!

6.10.4 Evaluate the modeling

Now that all of the models with possible parameter values have been trained, we can compare their performance. Figure 6.8 shows us the relationship between performance (as measured by the metrics we chose), the number of tokens, and regularization.

Pay attention to how increasing regularization affects performance as more tokens are included.

final_rs %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = as.factor(max_tokens))) +
  geom_line(size = 1.5, alpha = 0.5) +
  geom_point(size = 2, alpha = 0.9) +
  facet_wrap(~.metric, scales = "free_y") +
  scale_x_log10() +
  labs(
    color = "Number of tokens",
    title = "Lasso model performance across regularization penalties and number of tokens",
    subtitle = "The best model includes a high number of tokens but also significant regularization"
  )
Regularization becomes more important for performance as more tokens are included in the model

FIGURE 6.8: Regularization becomes more important for performance as more tokens are included in the model

Since this is our final version of this model, we want to choose final parameters and update our model object so we can use it with new data. We have several options for choosing our final parameters, such as selecting the simplest model within some limit around the numerically best result. Let’s stick with choosing the model with the best performance, in terms of one of the metrics we chose.

lowest_mae <- final_rs %>%
  select_best("mae")

lowest_mae
## # A tibble: 1 x 3
##   penalty max_tokens .config              
##     <dbl>      <int> <chr>                
## 1   0.144       6000 Preprocessor6_Model16

After we have those parameters, penalty and max_tokens, we can finalize our earlier tunable workflow, by updating it with this value.

final_wf <- finalize_workflow(tune_wf, lowest_mae)

final_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_tokenize()
## ● step_tokenfilter()
## ● step_tfidf()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.143844988828766
##   mixture = 1
## 
## Computational engine: glmnet

The final_wf workflow now has finalized values for max_tokens and penalty.

We can now fit this finalized workflow on training data and finally return to our testing data.

Notice that this is the first time we have used our testing data during this entire chapter; we tuned and compared models using resampled datasets instead of touching the testing set.

We can use the function last_fit() to fit our model one last time on our training data and evaluate it on our testing data. We only have to pass this function our finalized model/workflow and our data split.

final_fitted <- last_fit(final_wf, scotus_split)

collect_metrics(final_fitted)
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      11.9   Preprocessor1_Model1
## 2 rsq     standard       0.939 Preprocessor1_Model1

The metrics for the test set look about the same as the resampled training data and indicate we did not overfit during tuning. The RMSE of our final model has improved compared to our earlier models, both because we are combining multiple preprocessing steps and because we have tuned the number of tokens.

The output of last_fit() also contains a fitted model (a workflow, to be more specific), that has been trained on the training data. We can use the vip package to understand what the most important variables are in the predictions, shown in Figure 6.9.

library(vip)

scotus_imp <- pull_workflow_fit(final_fitted$.workflow[[1]]) %>%
  vi(lambda = lowest_mae$penalty)

scotus_imp %>%
  mutate(
    Sign = case_when(
      Sign == "POS" ~ "Later (after mean year)",
      Sign == "NEG" ~ "Earlier (before mean year)",
    ),
    Importance = abs(Importance),
    Variable = str_remove_all(Variable, "tfidf_text_")
  ) %>%
  group_by(Sign) %>%
  top_n(20, Importance) %>%
  ungroup() %>%
  ggplot(aes(
    x = Importance,
    y = fct_reorder(Variable, Importance),
    fill = Sign
  )) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  facet_wrap(~Sign, scales = "free") +
  labs(
    y = NULL,
    title = "Variable importance for predicting year of Supreme Court opinions",
    subtitle = "These features are the most importance in predicting the year of an opinion"
  )
Some words or bigrams increase a Supreme Court opinion's probability of being written later (more recently) while some increase its probability of being written earlier

FIGURE 6.9: Some words or bigrams increase a Supreme Court opinion’s probability of being written later (more recently) while some increase its probability of being written earlier

The tokens (unigrams or bigrams) that contribute in the positive direction, like “see supra” and “scalia”, are associated with higher, later years and those that contribute in the negative direction, like “years ago” and “founded”, are associated with lower, earlier years for these Supreme Court opinions.

Some of these features are unigrams and some are bigrams, but notice that none include stop words included in the Snowball lexicon because we removed them. The bigram “opinion court” is largely representing the phrase “opinion of the court”.

We can also examine how the true and predicted years compare for the testing set. Figure 6.10 shows us that, like for our earlier models on the resampled training data, we can predict the year of Supreme Court opinions for the testing data starting from about 1850. Predictions are less reliable before that year. This is an example of finding different error rates across sub-groups of observations, like we discussed in the foreword to these chapters; these differences can lead to unfairness and algorithmic bias when models are applied in the real world.

final_fitted %>%
  collect_predictions() %>%
  ggplot(aes(year, .pred)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Truth",
    y = "Predicted year",
    title = "Predicted and true years for the testing set of Supreme Court opinions",
    subtitle = "For the testing set, predictions are more reliable after 1850"
  )
Predicted and true years from a lasso regression model with bigrams and unigrams, and stop words removed

FIGURE 6.10: Predicted and true years from a lasso regression model with bigrams and unigrams, and stop words removed

6.11 Summary

You can use regression modeling to predict a continuous variable from a dataset, including a text dataset. Regularized linear models, such as lasso, often work well for text datasets, while tree-based models such as random forest often behave poorly. There are many possible preprocessing steps for text data, from removing stop words to n-gram tokenization strategies to lemmatization, that may improve your model. Resampling data sets and careful use of metrics allow you to make good choices among these possible options, given your own concerns and priorities.

6.11.1 In this chapter, you learned:

  • what kind of quantities can be modeled using regression
  • to tune a lasso regression model using resampled data
  • how to compare different model types
  • about measuring the impact of n-gram tokenization on models
  • how to implement lemmatization and stop word removal with text models
  • about performance metrics for regression models

References

Bates, Douglas, and Martin Maechler. 2019. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.

Benoit, Kenneth, and Akitaka Matsuo. 2019. Spacyr: Wrapper to the ’spaCy’ ’Nlp’ Library. https://CRAN.R-project.org/package=spacyr.

Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103.

Hvitfeldt, Emil. 2019b. Scotus: What the Package Does (One Line, Title Case). https://github.com/EmilHvitfeldt/scotus.

Olson, Randal S., William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. “Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems.” http://arxiv.org/abs/1708.05070.

Tang, Cheng, Damien Garreau, and Ulrike von Luxburg. 2018. “When Do Random Forests Fail?” In Advances in Neural Information Processing Systems, 2983–93.


  1. The random forest implementation in the ranger package, demonstrated in Section @ref{comparerf}, does not handle special characters in columns names well.↩︎

  2. The English word “raccoon” derives from an Algonquin word meaning, “scratches with his hands!”↩︎