Chapter 8 Classification

In this chapter, we will predict binary values, much like we did in Chapter 7, but we will use deep learning methods instead. We will be using a dataset of fundraising campaigns from Kickstarter.

library(tidyverse)

kickstarter <- read_csv("data/kickstarter.csv.gz")
kickstarter
## # A tibble: 269,790 x 3
##    blurb                                                        state created_at
##    <chr>                                                        <dbl> <date>    
##  1 Exploring paint and its place in a digital world.                0 2015-03-17
##  2 Mike Fassio wants a side-by-side photo of me and Hazel eati…     0 2014-07-11
##  3 I need your help to get a nice graphics tablet and Photosho…     0 2014-07-30
##  4 I want to create a Nature Photograph Series of photos of wi…     0 2015-05-08
##  5 I want to bring colour to the world in my own artistic skil…     0 2015-02-01
##  6 We start from some lovely pictures made by us and we decide…     0 2015-11-18
##  7 Help me raise money to get a drawing tablet                      0 2015-04-03
##  8 I would like to share my art with the world and to do that …     0 2014-10-15
##  9 Post Card don’t set out to simply decorate stories. Our goa…     0 2015-06-25
## 10 My name is Siu Lon Liu and I am an illustrator seeking fund…     0 2014-07-19
## # … with 269,780 more rows

we are working with fairly short texts for this dataset. less than a couple of hundred characters. We can look at the distribution

kickstarter %>%
  ggplot(aes(nchar(blurb))) +
  geom_histogram(binwidth = 1) +
  labs(
    x = "Number of characters per campaign blurb",
    y = "Number of campaign blurbs"
  )
Distribution of character count for Kickstarter campaign blurbs

FIGURE 8.1: Distribution of character count for Kickstarter campaign blurbs

it is rightly skewed which is to be expected. Since you don’t have have much space to make your impression most people choose to use most of it. There is one odd thing happening in this chart. There is a drop somewhere between 130 and 140. Let us investigate to see if we can find the reason.

We can use count() to find the most common blurb length.

kickstarter %>%
  count(nchar(blurb), sort = TRUE)
## # A tibble: 151 x 2
##    `nchar(blurb)`     n
##             <int> <int>
##  1            135 26827
##  2            134 18726
##  3            133 14913
##  4            132 13559
##  5            131 11322
##  6            130 10083
##  7            129  8786
##  8            128  7874
##  9            127  7239
## 10            126  6590
## # … with 141 more rows

it appears to be 135 which in and of itself doesn’t tell us much. It might be a glitch in the data collection process. Let us put our own eyes to look at what happens around this cutoff point. We can use slice_sample() to draw a random sample of the data.

We start by looking at blurbs with exactly 135 characters, this is done so that we can identify if the blurbs where cut short at 135 characters.

set.seed(1)
kickstarter %>%
  filter(nchar(blurb) == 135) %>%
  slice_sample(n = 5) %>%
  pull(blurb)
## [1] "A science fiction/drama about a young man and woman encountering beings not of this earth. Armed with only their minds to confront this"
## [2] "No, not my virginity. That was taken by a girl named Ramona the night of my senior prom. I'm talking about my novel, THE USE OF REGRET."
## [3] "In a city where the sun has stopped rising, the music never stops. Now only a man and his guitar can free the people from the Red King."
## [4] "First Interfaith & Community FM Radio Station needs transmitter in Menifee, CA Programs online, too CLICK PHOTO ABOVE FOR OUR CAT VIDEO"
## [5] "This documentary asks if the twenty-four hour news cycle has altered people's opinions of one another. We explore unity in one another."

It doesn’t appear to be the case as all of these blurbs appear coherent and some of them even end with a period to end the sentence. Let us now look at blurbs with more then 135 characters if these are different.

set.seed(1)
kickstarter %>%
  filter(nchar(blurb) > 135) %>%
  slice_sample(n = 5) %>%
  pull(blurb)
## [1] "This is a puzzle game for the Atari 2600. The unique thing about this is that (some) of the cartridge cases will be made out of real wood, hand carved"
## [2] "Art supplies for 10 girls on the east side of Detroit to make drawings of their neighborhood, which is also home to LOVELAND's Plymouth microhood"     
## [3] "Help us make a video for 'Never', one of the most popular songs on Songs To Wear Pants To and the lead single from Your Heart's upcoming album Autumn."
## [4] "Pyramid Cocoon is an interactive sculpture to be installed during the Burning Man Festival 2010. Users can rest, commune or cocoon in the piece"       
## [5] "Back us to own, wear, or see a show of great student art we've collected from Artloop partner schools in NYC. The $ goes right back to art programs!"

All of these blurbs also look good so it doesn’t look like a data collection issue. The kickstarter dataset also includes a created_at variable. Let us see what we can gather with that new information.

Below is a heatmap of the lengths of blurbs and the time the campaign was posted.

kickstarter %>%
  ggplot(aes(created_at, nchar(blurb))) +
  geom_bin2d() +
  labs(
    x = NULL,
    y = "Number of characters per campaign blurb"
  )
Distribution of character count for Kickstarter campaign blurbs over time

FIGURE 8.2: Distribution of character count for Kickstarter campaign blurbs over time

We see a trend right here. it appears that at the end of 2010 there was a change in policy to have the blurb length shortened from 150 characters to 135 characters.

kickstarter %>%
  filter(nchar(blurb) > 135) %>%
  summarise(max(created_at))
## # A tibble: 1 x 1
##   `max(created_at)`
##   <date>           
## 1 2010-10-20

We can’t tell for sure if the change happened at 2010-10-20, but that is the last day a campaign was launched with more then 135 characters.

8.1 A first classification model

Much like all our previous modeling, our first step is to split our data into training and testing sets. We will still use our training set to build models and save the testing set for a final estimate of how our model will perform on new data. It is very easy to overfit deep learning models, so an unbiased estimate of future performance from a test set is more important than ever. This data will be hard to work with since we don’t have much information to work with.

We use initial_split() to define the training/testing split. We will focus on modeling the blurb alone in this chapter. We will restict the data to only include blurbs with more then 15 characters. The short blurbs tend to uninformative single words.

library(tidymodels)
set.seed(1234)
kickstarter_split <- kickstarter %>%
  filter(nchar(blurb) >= 15) %>%
  initial_split()

kickstarter_train <- training(kickstarter_split)
kickstarter_test <- testing(kickstarter_split)

There are 202,093 press releases in the training set and 67,364 in the testing set.

8.1.1 Look at the data

8.1.2 Modeling

8.1.3 Evaluation

8.2 Preprocessing

Mostly the same, we still want to end with all numerics. Use keras/tensorflow to do preprocessing as an example.

8.3 putting your layers together

8.4 Use embedding

8.5 Model tuning

8.6 Case Study: Vary NN specific parameters

8.7 Look at different deep learning architecture

8.8 Case Study: Applying the wrong model

Here will we demonstrate what happens when the wrong model is used. Model from ML-classification will be used on this dataset.

8.9 Full game

All bells and whistles.