Chapter 9 Regression

In this chapter, we will predict continuous values, much like we did in Chapter 6, but we will use deep learning methods instead of methods such as regularized linear regression. Let’s consider a dataset of press releases from the United States Department of Justice (DOJ), which they release on their website.

library(tidyverse)

doj_press <- read_csv("data/press_releases.csv.gz")
doj_press
## # A tibble: 13,087 x 4
##    date       agency           title                   contents                 
##    <date>     <chr>            <chr>                   <chr>                    
##  1 2014-10-01 National Securi… "Convicted Bomb Plotte… "PORTLAND, Oregon. – Moh…
##  2 2012-07-25 Environment and… "$1 Million in Restitu… "WASHINGTON – North Caro…
##  3 2011-08-03 Environment and… "$1 Million Settlement… "BOSTON– A $1-million se…
##  4 2010-01-08 Environment and… "10 Las Vegas Men Indi… "WASHINGTON—A federal gr…
##  5 2018-07-09 Environment and… "$100 Million Settleme… "The U.S. Department of …
##  6 2015-07-22 USAO - Puerto R… "105 Individuals Indic… "A nine count federal in…
##  7 2014-01-10 Civil Rights Di… "12th Former Officer a… "Michael Morgan, formerl…
##  8 2014-12-17 Civil Division   "14 Indicted in Connec… "A 131-count criminal in…
##  9 2014-08-12 Criminal Divisi… "14 Individuals Charge… "Fourteen  individuals w…
## 10 2015-01-23 Criminal Divisi… "15th Member of Washin… "Defendant Admits To Tar…
## # … with 13,077 more rows

We know the date that each of these press releases was published, and predicting this date from other characteristics of the press releases, such as the main agency within the DOJ involved, the title, and the main contents of the press release, is a regression problem.

library(lubridate)

doj_press %>%
  count(month = floor_date(date, unit = "months"), name = "releases") %>%
  ggplot(aes(month, releases)) +
  geom_area(alpha = 0.8) +
  geom_smooth() +
  labs(x = NULL, y = "Releases per month")
Distribution of Department of Justice press releases over time

FIGURE 9.1: Distribution of Department of Justice press releases over time

This dataset includes all press releases from the DOJ from the beginning of 2009 through July 2018. There is some month-to-month variation and an overall increase in releases, but there is good coverage over the timeframe for which we would like to build a model.

There are 96 distinct main agencies associated with these releases, but some press releases have no agency associated with them. A few agencies, such as the Criminal Division, Civil Right Division, and Tax Division, account for many more press releases than other agencies.

doj_press %>%
  count(agency) %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(n, fct_reorder(agency, n))) +
  geom_col() +
  labs(x = "Number of press releases", y = NULL)
Main agency associated with Department of Justice press releases

FIGURE 9.2: Main agency associated with Department of Justice press releases

The DOJ press releases are relatively long documents; we will take this into consideration as we build neural network architectures for modeling.

library(tidytext)
doj_press %>%
  unnest_tokens(word, contents) %>%
  count(title) %>%
  ggplot(aes(n)) +
  geom_histogram(bins = 25, alpha = 0.8) +
  scale_x_log10(labels = scales::comma_format()) +
  labs(
    x = "Number of words per press release",
    y = "Number of press releases"
  )
Distribution of word count for Department of Justice press releases

FIGURE 9.3: Distribution of word count for Department of Justice press releases

Compared to the documents we built deep learning models for in Chapter 8, these press releases are long, with a median character count of 3,239 for the contents of the press releases. We can use deep learning models to model these longer sequences.

Some examples, such as this press release from the end of 2016, are quite short:

Deputy Attorney General Sally Q. Yates released the following statement after President Obama granted commutation of sentence to 153 individuals: “Today, another 153 individuals were granted commutations by the President. Over the last eight years, President Obama has given a second chance to over 1,100 inmates who have paid their debt to society. Our work is ongoing and we look forward to additional announcements from the President before the end of his term.”

9.1 A first regression model

As we walk through building a deep learning model, notice which steps are different and which steps are the same now that we use a neural network architecture.

Much like all our previous modeling, our first step is to split our data into training and testing sets. We will still use our training set to build models and save the testing set for a final estimate of how our model will perform on new data. It is very easy to overfit deep learning models, so an unbiased estimate of future performance from a test set is more important than ever.

We use initial_split() to define the training/testing split, after removing examples that have a title but no contents in the press release. We will focus mainly on modeling the contents in this chapter, although the title is also text that could be handled in a deep learning model. Almost all of the press releases have character counts between 500 and 50,000, but let’s exclude the ones that don’t because they will represent a challenge for the preprocessing required for deep learning models.

library(tidymodels)
library(lubridate)
set.seed(1234)
doj_split <- doj_press %>%
  filter(
    !is.na(contents),
    nchar(contents) > 5e2, nchar(contents) < 5e4
  ) %>%
  mutate(date = as.numeric(date) / 1e4) %>% ## can convert back with origin = "1970-01-01"
  initial_split()

doj_train <- training(doj_split)
doj_test <- testing(doj_split)

There are 9,784 press releases in the training set and 3,261 in the testing set.

We converted the date variable to its underlying numeric representation so we can more easily train any kind of regression model we want. To go from an object that has R’s date type to a numeric, use as.numeric(date). To convert back from this numeric representation to a date, use as.Date(date, origin = “1970-01-01”). That special date is the “origin” (like zero) for the numbering system used by R’s date types.

Notice that we also scaled (divided) the date outcome by a constant factor so all the values are closer to one. Deep learning models sometimes do not perform well when dealing with very large numeric values.

9.1.1 Preprocessing for deep learning

The preprocessing needed for deep learning network architectures is somewhat different than for the models we used in Chapters 7 and 6. The first step is still to tokenize the text, as described in Chapter 2. After we tokenize, we put a filter on how many words we’ll keep in the analysis; step_tokenfilter() keeps the top tokens based on frequency in this dataset.

library(textrecipes)

max_words <- 2e4
max_length <- 1e3

doj_rec <- recipe(~contents, data = doj_train) %>%
  step_tokenize(contents) %>%
  step_tokenfilter(contents, max_tokens = max_words) %>%
  step_sequence_onehot(contents,
    sequence_length = max_length,
    truncating = "post", padding = "post"
  )

doj_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          1
## 
## Operations:
## 
## Tokenization for contents
## Text filtering for contents
## Sequence 1 hot encoding for contents

After tokenizing, the preprocessing is different. We use step_sequence_onehot() to encode the sequences of words with integers representing each token in the vocabulary of 20,000 words. This is different than the representations we used in Chapters 7 and 6, mainly because all the information about word sequence is encoded in this representation.

Using step_sequence_onehot() to preprocess text data records and encodes sequence information, unlike the document-term matrix and/or bag-of-tokens approaches we used in Chapters (???)(mlclassification) and (???)(mlregression).

The DOJ press releases have a wide spread in document length, and we have to make a decision about how long of a sequence to include in our preprocessing.

  • If we choose the longest document, all the shorter documents will be “padded” with zeroes indicating no words or tokens in those empty spaces and our feature space will grow very large.
  • If we choose the shortest document as our sequence length, our feature space will be more manageable but all the longer documents will get cut off and we won’t include any of that information in our model.

In a situation like this, it’s can often work well to choose a medium sequence length, like 1000 words in this case, that involves truncating the longest documents and padding the shortest documents. We also center and scale the input data with step_normalize() because neural networks tend to work better this way.

In previous chapters, we used a preprocessing recipe like doj_rec in a tidymodels workflow but for our neural network models, we don’t have that option. We need to be able to work with the keras modeling functions directly because of the flexible options needed to build many neural network architectures. We need to execute our preprocessing recipe, using first prep() and then bake().

When we prep() a recipe, we compute or estimate statistics from the training set; the output of prep() is a recipe. When we bake() a recipe, we apply the preprocessing to a dataset, either the training set that we started with or another set like the testing data or new data. The output of bake() is a dataset like a tibble or a matrix.

We could have applied these functions to any preprocessing recipes in previous chapters, but we didn’t need to because our modeling workflows automated these steps.

doj_prep <- prep(doj_rec)
doj_matrix <- bake(doj_prep, new_data = NULL, composition = "matrix")

dim(doj_matrix)
## [1] 9784 1000

Here we use composition = "matrix" because the keras modeling functions operate on matrices.

9.1.2 Recurrent neural network

library(keras)

rnn_mod <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 64) %>%
  layer_simple_rnn(units = 16) %>%
  layer_dense(units = 1)

rnn_mod
## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## embedding (Embedding)               (None, None, 64)                1280064     
## ________________________________________________________________________________
## simple_rnn (SimpleRNN)              (None, 16)                      1296        
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       17          
## ================================================================================
## Total params: 1,281,377
## Trainable params: 1,281,377
## Non-trainable params: 0
## ________________________________________________________________________________

Because we are training a regression model, there is no activation function for the last layer; we want to fit and predict to arbitrary values for this numeric representation of date.

rnn_mod %>%
  compile(
    optimizer = "rmsprop",
    loss = "mse",
    metrics = c("mean_squared_error")
  )
history <- rnn_mod %>%
  fit(
    doj_matrix,
    doj_train$date,
    epochs = 10,
    batch_size = 128,
    validation_split = 0.2
  )
rnn_mod %>%
  evaluate(
    bake(doj_prep, doj_test, composition = "matrix"),
    doj_test$date
  )

9.1.3 Evaluation

9.2 Preprocessing

9.3 putting your layers together

Express difference from classification model.

9.4 Model tuning

9.5 Look at different deep learning architecture

9.6 Full game

All bells and whistles.

knitr::knit_exit()