Chapter 8 Classification

In this chapter, we will predict binary values, much like we did in Chapter 7, but we will use deep learning methods instead. We will be using a dataset of fundraising campaigns from Kickstarter.

library(tidyverse)

kickstarter <- read_csv("data/kickstarter.csv.gz")
kickstarter
## # A tibble: 269,790 x 3
##    blurb                                                        state created_at
##    <chr>                                                        <dbl> <date>    
##  1 Exploring paint and its place in a digital world.                0 2015-03-17
##  2 Mike Fassio wants a side-by-side photo of me and Hazel eati…     0 2014-07-11
##  3 I need your help to get a nice graphics tablet and Photosho…     0 2014-07-30
##  4 I want to create a Nature Photograph Series of photos of wi…     0 2015-05-08
##  5 I want to bring colour to the world in my own artistic skil…     0 2015-02-01
##  6 We start from some lovely pictures made by us and we decide…     0 2015-11-18
##  7 Help me raise money to get a drawing tablet                      0 2015-04-03
##  8 I would like to share my art with the world and to do that …     0 2014-10-15
##  9 Post Card don’t set out to simply decorate stories. Our goa…     0 2015-06-25
## 10 My name is Siu Lon Liu and I am an illustrator seeking fund…     0 2014-07-19
## # … with 269,780 more rows

we are working with fairly short texts for this dataset. less than a couple of hundred characters. We can look at the distribution

kickstarter %>%
  ggplot(aes(nchar(blurb))) +
  geom_histogram(binwidth = 1) +
  labs(
    x = "Number of characters per campaign blurb",
    y = "Number of campaign blurbs"
  )
Distribution of character count for Kickstarter campaign blurbs

FIGURE 8.1: Distribution of character count for Kickstarter campaign blurbs

it is rightly skewed which is to be expected. Since you don’t have much space to make your impression most people choose to use most of it. There is one odd thing happening in this chart. There is a drop somewhere between 130 and 140. Let us investigate to see if we can find the reason.

We can use count() to find the most common blurb length.

kickstarter %>%
  count(nchar(blurb), sort = TRUE)
## # A tibble: 151 x 2
##    `nchar(blurb)`     n
##             <int> <int>
##  1            135 26827
##  2            134 18726
##  3            133 14913
##  4            132 13559
##  5            131 11322
##  6            130 10083
##  7            129  8786
##  8            128  7874
##  9            127  7239
## 10            126  6590
## # … with 141 more rows

it appears to be 135 which in and of itself doesn’t tell us much. It might be a glitch in the data collection process. Let us put our own eyes to look at what happens around this cutoff point. We can use slice_sample() to draw a random sample of the data.

We start by looking at blurbs with exactly 135 characters, this is done so that we can identify if the blurbs were cut short at 135 characters.

set.seed(1)
kickstarter %>%
  filter(nchar(blurb) == 135) %>%
  slice_sample(n = 5) %>%
  pull(blurb)
## [1] "A science fiction/drama about a young man and woman encountering beings not of this earth. Armed with only their minds to confront this"
## [2] "No, not my virginity. That was taken by a girl named Ramona the night of my senior prom. I'm talking about my novel, THE USE OF REGRET."
## [3] "In a city where the sun has stopped rising, the music never stops. Now only a man and his guitar can free the people from the Red King."
## [4] "First Interfaith & Community FM Radio Station needs transmitter in Menifee, CA Programs online, too CLICK PHOTO ABOVE FOR OUR CAT VIDEO"
## [5] "This documentary asks if the twenty-four hour news cycle has altered people's opinions of one another. We explore unity in one another."

It doesn’t appear to be the case as all of these blurbs appear coherent and some of them even end with a period to end the sentence. Let us now look at blurbs with more than 135 characters if these are different.

set.seed(1)
kickstarter %>%
  filter(nchar(blurb) > 135) %>%
  slice_sample(n = 5) %>%
  pull(blurb)
## [1] "This is a puzzle game for the Atari 2600. The unique thing about this is that (some) of the cartridge cases will be made out of real wood, hand carved"
## [2] "Art supplies for 10 girls on the east side of Detroit to make drawings of their neighborhood, which is also home to LOVELAND's Plymouth microhood"     
## [3] "Help us make a video for 'Never', one of the most popular songs on Songs To Wear Pants To and the lead single from Your Heart's upcoming album Autumn."
## [4] "Pyramid Cocoon is an interactive sculpture to be installed during the Burning Man Festival 2010. Users can rest, commune or cocoon in the piece"       
## [5] "Back us to own, wear, or see a show of great student art we've collected from Artloop partner schools in NYC. The $ goes right back to art programs!"

All of these blurbs also look good so it doesn’t look like a data collection issue. The kickstarter dataset also includes a created_at variable. Let us see what we can gather with that new information.

Below is a heatmap of the lengths of blurbs and the time the campaign was posted.

kickstarter %>%
  ggplot(aes(created_at, nchar(blurb))) +
  geom_bin2d() +
  labs(
    x = NULL,
    y = "Number of characters per campaign blurb"
  )
Distribution of character count for Kickstarter campaign blurbs over time

FIGURE 8.2: Distribution of character count for Kickstarter campaign blurbs over time

We see a trend right here. it appears that at the end of 2010 there was a policy change to have the blurb length shortened from 150 characters to 135 characters.

kickstarter %>%
  filter(nchar(blurb) > 135) %>%
  summarise(max(created_at))
## # A tibble: 1 x 1
##   `max(created_at)`
##   <date>           
## 1 2010-10-20

We can’t tell for sure if the change happened in 2010-10-20, but that is the last day a campaign was launched with more than 135 characters.

8.1 A first classification model

Much like all our previous modeling, our first step is to split our data into training and testing sets. We will still use our training set to build models and save the testing set for a final estimate of how our model will perform on new data. It is very easy to overfit deep learning models, so an unbiased estimate of future performance from a test set is more important than ever. This data will be hard to work with since we don’t have much information to work with.

We use initial_split() to define the training/testing split. We will focus on modeling the blurb alone in this chapter. We will restrict the data to only include blurbs with more than 15 characters. The short blurbs tend to uninformative single words.

library(tidymodels)
set.seed(1234)
kickstarter_split <- kickstarter %>%
  filter(nchar(blurb) >= 15) %>%
  initial_split()

kickstarter_train <- training(kickstarter_split)
kickstarter_test <- testing(kickstarter_split)

There are 202,093 press releases in the training set and 67,364 in the testing set.

8.1.1 Preprocessing for deep learning

The way we will be doing preprocessing requires a hyperparameter denoting the length of sequences we would like to include. We need to select this value such that we don’t overshoot and introduce a lot of padded zeroes which would make the model hard to train, and we also need to avoid picking too short of a range.

We can use the count_words() function from the tokenizers package to calculate the number of words and generate a histogram. Notice how we are only using the training dataset to avoid data leakage when selecting this value.

kickstarter_train %>%
  mutate(n_words = tokenizers::count_words(blurb)) %>%
  ggplot(aes(n_words)) +
  geom_bar() +
  labs(
    x = "Number of words per campaign blurb",
    y = "Number of campaign blurbs"
  )
Distribution of word count for Kickstarter campaign blurbs

FIGURE 8.3: Distribution of word count for Kickstarter campaign blurbs

Given that we don’t have many words, to begin with, it makes sense to err on the side of longer sequences since we don’t want to lose valuable data. I would suppose that 30 would be a good cutoff point.

library(textrecipes)

max_words <- 20000
max_length <- 30

prepped_recipe <- recipe(~blurb, data = kickstarter_train) %>%
  step_tokenize(blurb) %>%
  step_tokenfilter(blurb, max_tokens = max_words) %>%
  step_sequence_onehot(blurb, sequence_length = max_length) %>%
  prep()

prepped_training <- prepped_recipe %>%
  bake(new_data = NULL, composition = "matrix")

8.1.2 One-hot sequence embedding of text

We have used step_sequence_onehot() to transforms the tokens into a numerical format, the main difference here is that this format takes into account the order of the tokens, unlike step_tf() and step_tfidf() which doesn’t take order into account. step_tf() and step_tfidf() are called bag-of-words for this reason. Let us take a closer look at how step_sequence_onehot() works and how its parameters will change the output.

When we are using step_sequence_onehot() two things are happening. First, each word is being assigned an integer index. You can think of this as key-index pair of the vocabulary. Next, the sequence of tokens will be replaced with their corresponding index and it is this sequence of integers that make up the final numerical representation. To illustrate here is a small example:

small_data <- tibble(
  text = c(
    "Adventure Dice Game",
    "Spooky Dice Game",
    "Illustrated Book of Monsters",
    "Monsters, Ghosts, Goblins, Me, Myself and I"
  )
)

small_spec <- recipe(~text, data = small_data) %>%
  step_tokenize(text) %>%
  step_sequence_onehot(text, sequence_length = 6, prefix = "") %>%
  prep()

Once we have the prep()ed recipe then we can tidy() it to extract the vocabulary. It is being represented in the vocabulary and token columns.

small_spec %>%
  tidy(2)
## # A tibble: 14 x 4
##    terms vocabulary token       id                   
##    <chr>      <int> <chr>       <chr>                
##  1 text           1 adventure   sequence_onehot_9SVGf
##  2 text           2 and         sequence_onehot_9SVGf
##  3 text           3 book        sequence_onehot_9SVGf
##  4 text           4 dice        sequence_onehot_9SVGf
##  5 text           5 game        sequence_onehot_9SVGf
##  6 text           6 ghosts      sequence_onehot_9SVGf
##  7 text           7 goblins     sequence_onehot_9SVGf
##  8 text           8 i           sequence_onehot_9SVGf
##  9 text           9 illustrated sequence_onehot_9SVGf
## 10 text          10 me          sequence_onehot_9SVGf
## 11 text          11 monsters    sequence_onehot_9SVGf
## 12 text          12 myself      sequence_onehot_9SVGf
## 13 text          13 of          sequence_onehot_9SVGf
## 14 text          14 spooky      sequence_onehot_9SVGf

The terms columns refer to the column we have applied step_sequence_onehot() to and id is its unique identifier. Note that textrecipes allow step_sequence_onehot() to be applied to multiple text variables independently and they will have their own vocabularies.

If we take a look at the resulting matrix we have 1 row per observation. The first row starts with some padded zeroes but turns into 3, 11, 14, which when matched with the vocabulary can construct the original sentence.

small_spec %>%
  juice(composition = "matrix")
##      _text_1 _text_2 _text_3 _text_4 _text_5 _text_6
## [1,]       0       0       0       1       4       5
## [2,]       0       0       0      14       4       5
## [3,]       0       0       9       3      13      11
## [4,]       6       7      10      12       2       8

But wait, the 4th line should have started with a 4 since the sentence starts with “I” but the first number is 13. This is happening because the sentence is too long to fit inside the specified length. This leads us to ask 3 questions before using step_sequence_onehot()

  1. How long should the output sequence be?
  2. What happens to too long sequences?
  3. What happens to too short sequences?

Choosing the right length is a balancing act. You want the length to be long enough such that you don’t truncate too much of your text data, but still short enough to keep the size of the data down and to avoid excessive padding. Truncating, having large data output and excessive padding all lead to worse model performance. This parameter is controlled by the sequence_length argument in step_sequence_onehot(). If the sequence is too long then we need to truncate it, this can be done by removing values from the beginning (“pre”) or the end (“post”) of the sequence. This choice is mostly influenced by the data, and you need to evaluate where most of the extractable information of the text is located. News articles typically start with the main points and then go into detail. If your goal is to detect the broad category then you properly want to keep the beginning of the texts, whereas if you are working with speeches or conversational text, then you might find that the last thing to be said carries more information and this would lead us to truncate from the beginning. Lastly, we need to decide how the padding should be done if the sentence is too short. Pre-padding tends to be more popular, especially when working with RNN and LSTM models since having post-padding could result in the hidden states getting flushed out by the zeroes before getting to the text itself.

step_sequence_onehot() defaults to sequence_length = 100, padding = "pre" and truncating = "pre". If we change the truncation to happen at the end

recipe(~text, data = small_data) %>%
  step_tokenize(text) %>%
  step_sequence_onehot(text,
    sequence_length = 6, prefix = "",
    padding = "pre", truncating = "post"
  ) %>%
  prep() %>%
  juice(composition = "matrix")
##      _text_1 _text_2 _text_3 _text_4 _text_5 _text_6
## [1,]       0       0       0       1       4       5
## [2,]       0       0       0      14       4       5
## [3,]       0       0       9       3      13      11
## [4,]      11       6       7      10      12       2

then we see the 4 at the beginning of the last row representing the “I”. The starting points are not aligned since we are still padding on the left side. We left-align all the sequences by setting padding = "post".

recipe(~text, data = small_data) %>%
  step_tokenize(text) %>%
  step_sequence_onehot(text,
    sequence_length = 6, prefix = "",
    padding = "post", truncating = "post"
  ) %>%
  prep() %>%
  juice(composition = "matrix")
##      _text_1 _text_2 _text_3 _text_4 _text_5 _text_6
## [1,]       1       4       5       0       0       0
## [2,]      14       4       5       0       0       0
## [3,]       9       3      13      11       0       0
## [4,]      11       6       7      10      12       2

Now we have that all the 4s neatly aligned in the first column.

8.1.3 Simple Flattened Dense network

The model we will be starting with a model that embeds sentences in sequences of vectors, flattening them, and then trains a dense layer on top.

library(keras)

dense_model <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = max_words + 1,
    output_dim = 12,
    input_length = max_length
  ) %>%
  layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

dense_model
## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## embedding (Embedding)               (None, 30, 12)                  240012      
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 360)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 32)                      11552       
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       33          
## ================================================================================
## Total params: 251,597
## Trainable params: 251,597
## Non-trainable params: 0
## ________________________________________________________________________________

Let us step through this model specification one layer at a time. We start the keras model by using keras_model_sequential() to indicate that we want to compose a linear stack of layers. Our first layer is an embedding layer via layer_embedding() This layer is e equipped to handle preprocessed data we have in prepped_training. It will take each observation/row in prepped_traning and embed each token to an embedding vector. This will result in each observation being turned into an (embedding_dim x sequence_length) matrix witch would be a (12 x 30) matrix with our settings, creating a (number of observations x embedding_dim x sequence_length) tensor. The layer_flatten() layer that follows takes the 2-dimensional tensors for each observation and flattens it down into 1 dimension. This will create a 30 * 12 = 360 tensor for each observation. Lastly, we have 2 densely connected layers with the last layer having a sigmoid activation function to give us a number between 0 and 1.

Now that we have specified the architecture of the model we still have a couple of things left to add to the model before we can fit it to the data. A keras model requires an optimizer and a loss function to be able to compile. When the neural network finished passing a batch of data through the network it needs to find a way to use the difference between the predicted values and true values to update the weights. the algorithm that determines those weights is known as the optimization algorithm. keras comes pre-loaded with many optimizers9 and you can even create custom optimizers if what you need isn’t on the list. We will start by using the rmsprop optimizer.

An optimizer can either be set with the name of the optimizer as a character or by supplying the function optimizer_() where is the name of the optimizer. If you use the function then you can specify parameters for the optimizer.

During training, we need to calculate a quantity that we want to have minimized. This is the loss function, keras comes pre-loaded with many loss functions10. These loss function will typically take in two values, typically the true value and the predicted value, and return a measure of how close they are. Since we are working on a binary classification task and have the final layer of the network return a probability, then we find binary cross-entropy to be an appropriate loss function. Binary cross-entropy does well at dealing with probabilities as it measures the “distance” between probability distributions. which would in our case be between the ground-truth distribution and the predictions.

We can also add any number of metrics11 to be calculated and reported during training. These metrics will not affect the training loop which is controlled by the optimizer and loss function. The metrics job is to report back a single number that will inform you of the user how well the model is performing. We will select accuracy as our metric for now. We can now set these 3 options; optimizer, loss, and metrics using the compile() function

dense_model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

Notice how the compile() function modifies the network in place. This is different then what is conventionally done in R where a new network object would have been returned.

Finally, we can fit the model. When we fit() a keras model we need to supply it with the data we are having the model train on. We need to supply this a matrix of predictors x and a numeric vector of labels y. This is sufficient information to start training the model. We are going to specify a couple more arguments to get better control of the training loop. First, we set the number of observations to pass through at a time with batch_size, and we set epochs = 20 to tell the model to pass all the data through the training loop 20 times. Lastly, we set validation_split = 0.2 to specify an internal validation split for when the metrics are calculated.

dense_history <- dense_model %>% fit(
  x = prepped_training,
  y = kickstarter_train$state,
  batch_size = 512,
  epochs = 20,
  validation_split = 0.2
)

We can visualize the results of the training loop by plot()ing the dense_history.

plot(dense_history)
Training and validation metrics for dense network

(#fig:dense_model_history_plot)Training and validation metrics for dense network

Now that we have the fitted model can we apply the model to our testing dataset to see how well the model performs on data it hasn’t seen.

dense_model %>%
  evaluate(
    bake(prepped_recipe, kickstarter_test, composition = "matrix"),
    kickstarter_test$state
  )
## $loss
## [1] 0.9574677
## 
## $acc
## [1] 0.8063951

we see that the accuracy very closely resembles the val_accuracy from the training loop, suggesting that we didn’t overfit our model.

8.2 Using pre-trained word embeddings

In the last section did we include an embedding layer, and we let the model train the embedding along with it. This is not the only way to handle this task. In chapter 5 we looked at how embeddings are created and how they are used. Instead of having the embedding layer start at random and have it being trained alongside the other parameters, let us try to supply them our self.

We start by getting a pre-trained embedding. The glove embedding that we used in section 5.3 will work for now. Setting dimensions = 50 and only selecting the first 12 dimensions will make it easier for us to compare models.

library(textdata)

glove6b <- embedding_glove6b(dimensions = 50) %>% select(1:13)
## # A tibble: 400,000 x 13
##    token      d1      d2      d3      d4      d5      d6      d7      d8      d9
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 "the" -0.0382 -0.245   0.728  -0.400   0.0832  0.0440 -0.391   0.334  -0.575 
##  2 ","   -0.108   0.111   0.598  -0.544   0.674   0.107   0.0389  0.355   0.0635
##  3 "."   -0.340   0.209   0.463  -0.648  -0.384   0.0380  0.171   0.160   0.466 
##  4 "of"  -0.153  -0.243   0.898   0.170   0.535   0.488  -0.588  -0.180  -1.36  
##  5 "to"  -0.190   0.0500  0.191  -0.0492 -0.0897  0.210  -0.550   0.0984 -0.201 
##  6 "and" -0.0720  0.231   0.0237 -0.506   0.339   0.196  -0.329   0.184  -0.181 
##  7 "in"   0.0857 -0.222   0.166   0.134   0.382   0.354   0.0129  0.225  -0.438 
##  8 "a"   -0.271   0.0440 -0.0203 -0.174   0.644   0.712   0.355   0.471  -0.296 
##  9 "\""  -0.305  -0.236   0.176  -0.729  -0.283  -0.256   0.266   0.0253 -0.0748
## 10 "'s"   0.589  -0.202   0.735  -0.683  -0.197  -0.180  -0.392   0.342  -0.606 
## # … with 399,990 more rows, and 3 more variables: d10 <dbl>, d11 <dbl>,
## #   d12 <dbl>

The embedding_glove6b() function returns a tibble which isn’t the right format for what keras expects. Also, take notice of how many rows are present in this embedding. Far more than what the trained recipe is expecting to return. The vocabulary can be extracted from the trained recipe using tidy(). First, we apply tidy() to prepped_recipe to get the list of steps that the recipe contains.

tidy(prepped_recipe)
## # A tibble: 3 x 6
##   number operation type            trained skip  id                   
##    <int> <chr>     <chr>           <lgl>   <lgl> <chr>                
## 1      1 step      tokenize        TRUE    FALSE tokenize_eDrDa       
## 2      2 step      tokenfilter     TRUE    FALSE tokenfilter_zDVeF    
## 3      3 step      sequence_onehot TRUE    FALSE sequence_onehot_TaBPG

We see that the 3rd step is the sequence_onhot step, so by setting number = 3 can we extract the vocabulary of transformation.

tidy(prepped_recipe, number = 3)
## # A tibble: 20,000 x 4
##    terms vocabulary token id                   
##    <chr>      <int> <chr> <chr>                
##  1 blurb          1 0     sequence_onehot_TaBPG
##  2 blurb          2 00    sequence_onehot_TaBPG
##  3 blurb          3 000   sequence_onehot_TaBPG
##  4 blurb          4 00pm  sequence_onehot_TaBPG
##  5 blurb          5 01    sequence_onehot_TaBPG
##  6 blurb          6 02    sequence_onehot_TaBPG
##  7 blurb          7 03    sequence_onehot_TaBPG
##  8 blurb          8 05    sequence_onehot_TaBPG
##  9 blurb          9 06    sequence_onehot_TaBPG
## 10 blurb         10 07    sequence_onehot_TaBPG
## # … with 19,990 more rows

This list of tokens can then be left_join()ed to the glove6b embedding tibble to only keep the tokens of interest. Any tokens from the vocabulary not found in glove6b is replaced with 0 using mutate_all() and replace_na(). The results are turned into a matrix, and a row of zeroes is added at the top of the matrix to account for the out-of-vocabulary words.

glove6b_matrix <- tidy(prepped_recipe, 3) %>%
  select(token) %>%
  left_join(glove6b, by = "token") %>%
  mutate_all(replace_na, 0) %>%
  select(-token) %>%
  as.matrix() %>%
  rbind(0, .)

The way the model is constructed will remain as unchanged as possible. We make sure that the output_dim argument is being set to be equal toncol(glove6b_matrix), this is a way to make sure that all the dimensions will line up nicely. Everything else stays the same.

dense_model_pte <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = max_words + 1,
    output_dim = ncol(glove6b_matrix),
    input_length = max_length
  ) %>%
  layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

Now we use get_layer() to access the first layer which is the embedding layer, then we set the weights with set_weights() and lastly we freeze the weights with freeze_weights(). Freezing the weights stops them from being updated during the training loop.

dense_model_pte %>%
  get_layer(index = 1) %>%
  set_weights(list(glove6b_matrix)) %>%
  freeze_weights()

Now we will compile and fit the model just like the last one we looked at.

dense_model_pte %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

dense_pte_history <- dense_model_pte %>% fit(
  x = prepped_training,
  y = kickstarter_train$state,
  batch_size = 512,
  epochs = 20,
  validation_split = 0.2
)

This model is not performing as well as the previous model and the evaluation isn’t that much better.

dense_model_pte %>%
  evaluate(
    bake(prepped_recipe, kickstarter_test, composition = "matrix"),
    kickstarter_test$state
  )

Why is this happening? Part of the training loop is about adjusting the weights in the network. When we froze the weights in this network it appears that we froze them at values that did not perform very well. This pre-trained glove embedding(Pennington, Socher, and Manning 2014) we are using have been trained on a Wikipedia dump and Gigaword 5 which is a comprehensive archive of newswire text. The text contained on Wikipedia and in new articles both follows certain styles and semantics. Both will tend to be written formally and in the past tense. They also contain longer and complete sentences. There are many more distinct features of both Wikipedia text and news articles, but the important part is how similar they are to the data we are trying to use. These text fields are very short, lack punctuation, stop words, narrative, and tense. Many of them simply try to pack as many buzz words in as possible while keeping the sentence readable. It is not surprising that the word embedding doesn’t perform well in this model since the text it is trained on is so far removed from the text is it being applied on.

Although this didn’t work that well, doesn’t mean that using pre-trained word embeddings are useless. Sometimes they can perform very well, the important part is how well the embedding fits the data you are using. there is one more way we can use this embedding in our network, we can load it in as before but not freeze the weights. This allows the models to still adjust the weights to better fit the data, and the hope is that this pre-trained embedding delivers a better starting point than the randomly generated embedding we get if we don’t set the weights.

We specify a new model

dense_model_pte2 <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = max_words + 1,
    output_dim = ncol(glove6b_matrix),
    input_length = max_length
  ) %>%
  layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

set the weights with set_weights() but we don’t freeze them

dense_model_pte2 %>%
  get_layer(index = 1) %>%
  set_weights(list(glove6b_matrix))

and we compile and fit the model as we did last time

dense_model_pte2 %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

dense_pte2_history <- dense_model_pte2 %>% fit(
  x = prepped_training,
  y = kickstarter_train$state,
  batch_size = 512,
  epochs = 40,
  validation_split = 0.2
)
dense_model_pte2 %>%
  evaluate(
    bake(prepped_recipe, kickstarter_test, composition = "matrix"),
    kickstarter_test$state
  )

This performs quite a bit better than when we froze the weights. However, it is training slower than when we didn’t set weights since we had to run it for around 40 epochs before we start to overfit.

If you have enough corpus data in the field you are working on, then it would be worth considering training a word embedding that better captures the structure of the domain you are trying to work with.

8.3 Convolutional Neural Networks

The first networks we have shown in this chapter doesn’t take advantage of the sequential patterns. Text can have patterns of varying length, and this can be hard for a simple dense network to pick up on and learn. Patterns can be encoded as n-grams2.2.3, but this presents problems if you want to encode these n-grams directly since the dimensionality of the vocabulary shoots up even we just try to capture n = 2 and n = 3.

The convolutional neural network (CNN) architecture is the most complicated network architecture we have seen so far, so we will take some time to review the construction, the different features, and the hyperparameters you can tune. The goal of this section is to give you an intuition on how each aspect of the CNN affects the behavior. CNNs are well suited to pick up on spatial structures within the data, this is a powerful feature for working with text since text typically contains a good amount of local structure within the text, especially when characters are used as the token. CNNs become efficient layers by having a small number of weights which is used to scan the input tensor, the output tensor that is produced then hopefully can represent specific structures in the data.

CNNs can work with 1, 2, 3-dimensional data, but it will mostly involve only 1 dimension when we are using it on text, the following illustrations and explanations will be done in 1 dimension to closely match the use-case we are looking at for this book. Figure 8.4 illustrates a stereotypical CNN architecture. You start with your input sequence, this example uses characters as the token, but it could just as well be words. Then a filter slides along the sequence to produce a new and smaller sequence. This is done multiple times, typically with varying parameters for each layer until we are left with a small tensor which we then transform into our required output shape, 1 value between 0 and 1 in the case of classification.

A template CNN architecture for 1 dimensional input data. A sequence of consequtive CNN layers will incremently reduce the tensor size, ending up with single value.

FIGURE 8.4: A template CNN architecture for 1 dimensional input data. A sequence of consequtive CNN layers will incremently reduce the tensor size, ending up with single value.

This figure lies a little bit since we technically don’t feed characters into it, but instead uses sequence one-hot encoding with a possible word embedding. We will now go through some of the most important concepts about CNNs.

8.3.1 Filters

The kernel is a small tensor of the same dimensionality as the input tensor that slides along the input tensor. When it is sliding it performs element-wise multiplication of the values in the input tensor and its weights and then summing up the values to get a single value. Sometimes an activation function will be applied as well. It is these weights that are trained with gradient descent to find the best fit. In keres, the filters represent how many different kernels are trained in each layer. You typically start with fewer filters at the beginning of your network and then increase them as you go along.

8.3.2 Kernel size

The most prominent hyperparameter is the kernel size. The kernel size is the size of the tensor, 1 dimensional is this case, that contains the weights. A kernel with size 5 will have 5 weights. These kernels will similarly capture local information to how n-grams capture location patterns. Increasing the size of the kernel will decrease the size of the output tensor, as we see in figure 8.5

The kernel size affects the size of the resulting tensor. A kernel size of 3 uses the information from 3 values to calculate 1 value.

FIGURE 8.5: The kernel size affects the size of the resulting tensor. A kernel size of 3 uses the information from 3 values to calculate 1 value.

Larger kernels will detect larger and less frequent patterns where smaller kernels will find fine-grained features. Notice how the choice of the token will affect how we think about kernel size. For character level tokens a kernel size of 5 will in early layers find patterns in parts of words more often than patterns across words since 5 characters aren’t enough the adequately span multiple words. Where on the other hand a kernel size of 5 for word-level tokens will find patterns in parts of sentences instead. Kernels most have an odd length.

8.3.3 Stride

The stride is the second big hyperparameter that controls the kernels in a CNN. The stride length determines how much the kernel moves along the sequence between each calculation. A stride length of 1 means that the kernel moves over one place at a time, this way we get maximal overlap.

The stride length affects the size of the resulting tensor. When stride = 1 then the window slides along one by one. Increasing the slide length decreases the resulting tensor by skipping windows.

FIGURE 8.6: The stride length affects the size of the resulting tensor. When stride = 1 then the window slides along one by one. Increasing the slide length decreases the resulting tensor by skipping windows.

In figure 8.6 we see that if the kernel size and stride length are equal then there is no overlap. We can decrease the size of the output tensor by increasing the stride length. Be careful not to set the stride length to be larger than the kernel size, otherwise, then you will skip over some of the information.

8.3.4 Dilation

The dilation controls how the kernel is applied to the input tensor. So far we have shown examples where the dilation is equal to 1. This means that each value from the input tensor will be spaced 1 distance apart from each other.

The dilation affects the size of the resulting tensor. When dilation = 1 then consecutive values are taking from the input. Increasing the dilation leaves gaps between input values and decreases the resulting tensor.

FIGURE 8.7: The dilation affects the size of the resulting tensor. When dilation = 1 then consecutive values are taking from the input. Increasing the dilation leaves gaps between input values and decreases the resulting tensor.

If we increase the dilation then can see in figure 8.7 that there will be spaces or gaps between the input values. This allows the kernel to find large spatial patterns that span many tokens. This is a useful trick to be able to extract features and structure from long sequences. Dilated convolutional layers when put in succession will be able to find patterns in very large sequences.

8.3.5 Padding

The last hyperparameter we will talk about is padding. One of the downsides to how the kernels are being used in the previous figures is how it handles the edge of the sequence. Padding is the act of putting something before and after the sequence when the convolution is taking place to be able to extract more information from the first and last tokens in the sequence. Padding will lead to larger output tensors since they we let the kernel move more.

8.3.6 Simple CNN

We will start with a fairly standard CNN specification that closely follows what we saw in figure 8.4. We start with an embedding layer followed by a sequence of 1 dimensional convolution layers layer_conv_1d(), followed by a global max pooling layer layer_global_max_pooling_1d() and a dense layer with a sigmoid activation function to give us 1 value between 0 and 1 to use in our classification.

simple_cnn_model <- keras_model_sequential() %>%
  layer_embedding(
    input_dim = max_words + 1, output_dim = 16,
    input_length = max_length
  ) %>%
  layer_conv_1d(filter = 16, kernel_size = 11, activation = "relu") %>%
  layer_conv_1d(filter = 32, kernel_size = 9, activation = "relu") %>%
  layer_conv_1d(filter = 64, kernel_size = 7, activation = "relu") %>%
  layer_conv_1d(filter = 128, kernel_size = 5, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 1, activation = "sigmoid")

cnn_model

We are using the same embedding layer as we did in the previous networks so there is nothing new there. We are having 4 convolutional layers. And there are some things to take note off here. The model is using an increasing number of filters in each layer, doubling the number of filters for each layer. This is to make sure that there are more filter latter on to capture enough of the global information. The kernel size in this model starts kinda high and then slowly decreases. This model will be able to find quite large patterns in the data. We use a layer_global_max_pooling_1d() layer to transform to collapse the remaining CNN output into 1 dimension and we finish it off with a densely connected layer and a sigmoid activation function.

This might not end up being the best CNN configuration, but it is a good starting point. One of the challenges when working with CNNS is to make sure that you manage the dimensionality correctly. You will have to handle the trade-off between having a small number of layers with hyperparameters that are set to decrease the dimensions drastically, or having a larger amount of layers where each output is only slightly smaller then the previous. Networks with fewer layers can give good performance and fast since there isn’t that many weights to train, but you need to be careful that you construct the layers to correctly capture the patterns you want.

The compilation and fitting is the same as we have seen before.

cnn_model %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- cnn_model %>% fit(
  x = prepped_training,
  y = kickstarter_train$state,
  batch_size = 512,
  epochs = 10,
  validation_split = 0.2
)

We are using the "adam" optimizer since it performs well for this model.

You will have to experiment to find the optimizer that works best for your specific model.

cnn_model %>%
  evaluate(
    bake(prepped_recipe, kickstarter_test, composition = "matrix"),
    kickstarter_test$state
  )

8.4 Character level Convolutional Neural Network

8.5 Using different optimizers

8.6 Case Study: Vary NN specific parameters

8.7 Look at different deep learning architecture

8.8 Case Study: Applying the wrong model

Here will we demonstrate what happens when the wrong model is used. Model from ML-classification will be used on this dataset.

8.9 Full game

All bells and whistles.

References

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Empirical Methods in Natural Language Processing (Emnlp), 1532–43. http://www.aclweb.org/anthology/D14-1162.