# Chapter 7 Classification

In Chapter 6, we focused on modeling to predict continuous values for documents, such as what year a Supreme Court opinion was published. This is an example of a regression model. We can also use machine learning to predict labels on documents using a classification model. For example, let’s consider the dataset of consumer complaints submitted to the US Consumer Finance Protection Bureau. Let’s read in the complaint data (Appendix 12.4) with read_csv().

library(tidyverse)
complaints <- read_csv("data/complaints.csv.gz")

We can start by taking a quick glimpse() at the data to see what we have to work with. This dataset contains a text field with the complaint, along with information regarding what it was for, how and when it was filed, and the response from the bureau.

glimpse(complaints)
## Rows: 117,214
## Columns: 18
## $date_received <date> 2019-09-24, 2019-10-25, 2019-11-08, 201… ##$ product                      <chr> "Debt collection", "Credit reporting, cr…
## $sub_product <chr> "I do not know", "Credit reporting", "I … ##$ issue                        <chr> "Attempts to collect debt not owed", "In…
## $sub_issue <chr> "Debt is not yours", "Information belong… ##$ consumer_complaint_narrative <chr> "transworld systems inc. \nis trying to …
## $company_public_response <chr> NA, "Company has responded to the consum… ##$ company                      <chr> "TRANSWORLD SYSTEMS INC", "TRANSUNION IN…
## $state <chr> "FL", "CA", "NC", "RI", "FL", "TX", "SC"… ##$ zip_code                     <chr> "335XX", "937XX", "275XX", "029XX", "333…
## $tags <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ consumer_consent_provided    <chr> "Consent provided", "Consent provided", …
## $submitted_via <chr> "Web", "Web", "Web", "Web", "Web", "Web"… ##$ date_sent_to_company         <date> 2019-09-24, 2019-10-25, 2019-11-08, 201…
## $company_response_to_consumer <chr> "Closed with explanation", "Closed with … ##$ timely_response              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
## $consumer_disputed <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"… ##$ complaint_id                 <dbl> 3384392, 3417821, 3433198, 3366475, 3385…

In this chapter, we will build classification models to predict what type of financial product the complaints are referring to, i.e., a label or categorical variable.

## 7.1 A first classification model

For our first model, let’s build a binary classification model to predict whether a submitted complaint is about “Credit reporting, credit repair services, or other personal consumer reports” or not.

This kind of “yes or no” binary classification model is both common and useful in real-world text machine learning problems.

The outcome variable product contains more categories than this, so we need to transform this variable to only contains the values “Credit reporting, credit repair services, or other personal consumer reports” and “Other”.

It is always a good idea to look at your data! Here are the first six complaints:

head(complaints$consumer_complaint_narrative) ## [1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate." ## [2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act." ## [3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work." ## [4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX" ## [5] "While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information." ## [6] "I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX" The complaint narratives contain many series of capital "X"’s. These strings (like “XX/XX” or “XXXX XXXX XXXX XXXX”) are used to to protect personally identifiable information (PII) in this publicly available dataset. This is not a universal censoring mechanism; censoring and PII protection will vary from source to source. Hopefully you will be able to find information on PII censoring in a data dictionary, but you should always look at the data yourself to verify. We also see that monetary amounts are surrounded by curly brackets (like "{$21.00}"); this is another text preprocessing step that has been taken care of for us. We could craft a regular expression to extract all the dollar amounts.

complaints$consumer_complaint_narrative %>% str_extract_all("\\{\\$[0-9\\.]*\\}") %>%
compact() %>%
head()
## [[1]]
## [1] "{$21.00}" "{$1.00}"
##
## [[2]]
## [1] "{$2300.00}" ## ## [[3]] ## [1] "{$200.00}"  "{$5000.00}" "{$5000.00}" "{$770.00}" "{$800.00}"
## [6] "{$5000.00}" ## ## [[4]] ## [1] "{$15000.00}" "{$11000.00}" "{$420.00}"   "{$15000.00}" ## ## [[5]] ## [1] "{$0.00}" "{$0.00}" "{$0.00}" "{$0.00}" ## ## [[6]] ## [1] "{$650.00}"

In Section 7.7, we will use an approach like this for custom feature engineering from the text.

### 7.1.1 Building our first classification model

This dataset includes more possible predictors than the text alone, but for this first model we will only use the text variable consumer_complaint_narrative. Let’s create a factor outcome variable product with two levels, “Credit” and “Other”. Then, we split the data into training and testing datasets. We can use the initial_split() function from rsample to create this binary split of the data. The strata argument ensures that the distribution of product is similar in the training set and testing set. Since the split uses random sampling, we set a seed so we can reproduce our results.

library(tidymodels)

set.seed(1234)
complaints2class <- complaints %>%
mutate(product = factor(if_else(
product == "Credit reporting, credit repair services, or other personal consumer reports",
"Credit", "Other"
)))

complaints_split <- initial_split(complaints2class, strata = product)

complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)

The dimensions of the two splits show that this first step worked as we planned.

dim(complaints_train)
## [1] 87911    18
dim(complaints_test)
## [1] 29303    18

Next we need to preprocess this data to prepare it for modeling; we have text data, and we need to build numeric features for machine learning from that text.

The recipes package, part of tidymodels, allows us to create a specification of preprocessing steps we want to perform. These transformations are estimated (or “trained”) on the training set so that they can be applied in the same way on the testing set or new data at prediction time, without data leakage. We initialize our set of preprocessing transformations with the recipe() function, using a formula expression to specify the variables, our outcome plus our predictor, along with the dataset.

complaints_rec <-
recipe(product ~ consumer_complaint_narrative, data = complaints_train)

Now we add steps to process the text of the complaints; we use textrecipes to handle the consumer_complaint_narrative variable. First we tokenize the text to words with step_tokenize(). By default this uses tokenizers::tokenize_words(). Next we remove stop words with step_stopwords(); the default choice is the Snowball stop word list, but custom lists can be provided too. Before we calculate tf-idf we use step_tokenfilter() to only keep the 500 most frequent tokens, to avoid creating too many variables in our first model. To finish, we use step_tfidf() to compute tf-idf.

library(textrecipes)

complaints_rec <- complaints_rec %>%
step_tokenize(consumer_complaint_narrative) %>%
step_stopwords(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 500) %>%
step_tfidf(consumer_complaint_narrative)

Now that we have a full specification of the preprocessing recipe, we can prep() this recipe to estimate all the necessary parameters for each step using the training data.

complaint_prep <- prep(complaints_rec)

For most modeling tasks, you will not need to prep() your recipe directly; instead you can build up a tidymodels workflow() to bundle together your modeling components.

complaint_wf <- workflow() %>%
add_recipe(complaints_rec)

Let’s start with a naive Bayes model (Sang-Bum Kim et al. 2006; Kibriya et al. 2005; Frank and Bouckaert 2006), which is available in the tidymodels package discrim. One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods. Here we have only kept the 500 most frequent tokens, but we could have kept more tokens and a naive Bayes model would still be able to handle such predictors well. For now, we will limit the model to a moderate number of tokens.

library(discrim)
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("naivebayes")

nb_spec
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes

Now we have everything we need to fit our first classification model. We can add the naive Bayes model to our workflow, and then we can fit this workflow to our training data.

nb_fit <- complaint_wf %>%
fit(data = complaints_train)

We have trained our first classification model!

### 7.1.2 Evaluation

Like we discussed in Section 6.1.2, we should not use the test set to compare models or different model parameters. The test set is a precious resource that should only be used at the end of the model training process to estimate performance on new data. Instead, we will use resampling methods to evaluate our model.

Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit. We can do this using resampled datasets built from the training set. Let’s create cross 10-fold cross-validation sets, and use these resampled sets for performance estimates.

set.seed(234)
complaints_folds <- vfold_cv(complaints_train)

complaints_folds
## #  10-fold cross-validation
## # A tibble: 10 x 2
##    splits               id
##    <list>               <chr>
##  1 <split [79.1K/8.8K]> Fold01
##  2 <split [79.1K/8.8K]> Fold02
##  3 <split [79.1K/8.8K]> Fold03
##  4 <split [79.1K/8.8K]> Fold04
##  5 <split [79.1K/8.8K]> Fold05
##  6 <split [79.1K/8.8K]> Fold06
##  7 <split [79.1K/8.8K]> Fold07
##  8 <split [79.1K/8.8K]> Fold08
##  9 <split [79.1K/8.8K]> Fold09
## 10 <split [79.1K/8.8K]> Fold10

Each of these splits contains information about how to create cross-validation folds from the original training data. In this example, 90% of the training data is included in each fold and the other 10% is held out for evaluation.

For convenience, let’s again use a workflow() for our resampling estimates of performance.

Using a workflow() isn’t required (you can fit or tune a model plus a preprocessor) but it can make your code easier to read and organize.

nb_wf <- workflow() %>%

nb_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes

In the last section, we fit one time to the training data as a whole. Now, to estimate how well that model performs, let’s fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.

nb_rs <- fit_resamples(
nb_wf,
complaints_folds,
control = control_resamples(save_pred = TRUE)
)

We can extract the relevant information using collect_metrics() and collect_predictions()

nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)

What results do we see, in terms of performance metrics?

nb_rs_metrics
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.728    10 0.00339
## 2 roc_auc  binary     0.890    10 0.00151

The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator curve). The accuracy is the percentage of accurate predictions. For these resamples, the average accuracy is 72.8%.

For both accuracy and ROC AUC, values closer to 1 are better.

The receiver operator curve is a plot that shows the sensitivity at different thresholds. It demonstrates how well a classification model can distinguish between classes. Figure 7.1 shows the ROC curve for our first classification model on each of the resampled datasets.

nb_rs_predictions %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "Receiver operator curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)

The area under each of these curves is the roc_auc metric we have computed. If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.

Another way to evaluate our model is to evaluate the confusion matrix. A confusion matrix visualizes a model’s false positives and false negatives for each class. There is not a trivial way to visualize multiple confusion matrices, so we can look at them individually for a single fold.

nb_rs_predictions %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap")

In Figure 7.2, the diagonal squares have darker shades than the off diagonal squares. This is a good sign meaning that our model is right more often then not. However, this first model is struggling somewhat since it is close to even odds when predicting something from the “Other” class.

One metric alone cannot give you a complete picture of how well your classification model is performing. The confusion matrix is a good starting point to get an overview of your model performance as it includes rich information.

This is real data from a government agency, and these kinds of performance metrics must be interpreted in the context of how such a model would be used. What happens if the model we trained gets a classification wrong for a consumer complaint? What impact will it have if more “Credit” complaints are correctly identified than “Other” complaints, either for consumers or for policymakers?

## 7.2 Compare to the null model

Like we did in Section 6.2, we can assess a model like this one by comparing its performance to a “null model” or baseline model, a simple, non-informative model that always predicts the largest class for classification. Such a model is perhaps the simplest heuristic or rule-based alternative that we can consider as we assess our modeling efforts.

We can build a classification null_model() specification and add it to a workflow() with the same preprocessing recipe we used in the previous section, to estimate performance.

null_classification <- null_model() %>%
set_engine("parsnip") %>%
set_mode("classification")

null_rs <- workflow() %>%
fit_resamples(
complaints_folds
)

What results do we obtain from the null model, in terms of performance metrics?

null_rs %>%
collect_metrics()
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>
## 1 accuracy binary     0.526    10 0.00149 Preprocessor1_Model1
## 2 roc_auc  binary     0.5      10 0       Preprocessor1_Model1

The accuracy and ROC indicate that this null model is, like in the regression case, dramatically worse than even our first model. The text of the CFPB complaints is predictive relative to the category we are building models for.

## 7.3 Compare to an SVM model

Support vector machines are a class of machine learning model that can be used in regression and classification tasks. While they don’t see widespread use in cutting-edge machine learning research today, they are frequently used in practice and have properties that make them well-suited for text classification (Joachims 1998) and can give good performance (Van-Tu and Anh-Cuong 2016).

Let’s create a specification of an SVM model with a radial basis function as the kernel, a good default for SVMs.

svm_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("liquidSVM")

svm_spec
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

Then we can create another workflow() object with the SVM specification. Notice that we can reuse our text preprocessing recipe.

svm_wf <- workflow() %>%

svm_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

The liquidSVM engine doesn’t support class probabilities as output so we need to replace the default metric set with a metric set that doesn’t use class probabilities. Here we use accuracy, sensitivity, and specificity.

set.seed(2020)
svm_rs <- fit_resamples(
svm_wf,
complaints_folds,
metrics = metric_set(accuracy, sensitivity, specificity),
control = control_resamples(save_pred = TRUE)
)

Let’s again extract the relevant information using collect_metrics() and collect_predictions()

svm_rs_metrics <- collect_metrics(svm_rs)
svm_rs_predictions <- collect_predictions(svm_rs)

Now we can see that svm_rs_metrics contains the three performance metrics we chose for the SVM model.

svm_rs_metrics
## # A tibble: 3 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.862    10 0.00132
## 2 sens     binary     0.865    10 0.00227
## 3 spec     binary     0.859    10 0.00265

This looks pretty promising, considering we didn’t do any hyperparameter tuning on the model parameters. Let’s finish this section by generating a confusion matrix, shown in Figure 7.3. Our SVM model is much better at separating the classes than the naive Bayes model in Section 7.1.1, and our results are much more symmetrical than those for the naive Bayes model in Figure 7.2.

svm_rs_predictions %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap")

One of the main benefits of a support vector machine model is its support for sparse data, making this algorithm a great match for text data. Support vector machine models generally also perform well with many predictors, which are again quite characteristic of text data.

## 7.4 Two class or multiclass?

Most of this chapter focuses on binary classification, where we have two classes in our outcome variable (such as “Credit” and “Other”) and each observation can either be one or the other. This is a simple scenario with straightforward evaluation strategies because the results only have a two-by-two contingency matrix. However, it is not always possible to limit a modeling question to two classes. Let’s explore how to deal with situations where we have more than two classes. The CFPB complaints dataset we have been working with has nine different product classes. In decreasing frequency, they are:

• Credit reporting, credit repair services, or other personal consumer reports
• Debt collection
• Credit card or prepaid card
• Mortgage
• Checking or savings account
• Student loan
• Vehicle loan or lease
• Money transfer, virtual currency, or money service
• Payday loan, title loan, or personal loan

We assume that there is a reason why these product classes have been created in this fashion by this government agency. Perhaps complaints from different classes are handled by different people or organizations. Whatever the reason, in this section we would like to build a multiclass classifier to identify these nine specific product classes.

We need to create a new split of the data using initial_split() on the unmodified complaints dataset.

set.seed(1234)

multicomplaints_split <- initial_split(complaints, strata = product)

multicomplaints_train <- training(multicomplaints_split)
multicomplaints_test <- testing(multicomplaints_split)

Before we continue, let us take a look at the number of cases in each of the classes.

multicomplaints_train %>%
count(product, sort = TRUE) %>%
select(n, product)
## # A tibble: 9 x 2
##       n product
##   <int> <chr>
## 1 41724 Credit reporting, credit repair services, or other personal consumer re…
## 2 16688 Debt collection
## 3  8648 Credit card or prepaid card
## 4  7111 Mortgage
## 5  5145 Checking or savings account
## 6  2930 Student loan
## 7  2049 Vehicle loan or lease
## 8  1938 Money transfer, virtual currency, or money service
## 9  1678 Payday loan, title loan, or personal loan

There is significant imbalance between the classes that we must address, with over twenty times more cases of the majority class than there is of the smallest class. This kind of imbalance is a common problem with multiclass classification, with few multiclass datasets in the real world exhibiting balance between classes.

Compared to binary classification, there are several additional issues to keep in mind when working with multiclass classification:

• Many machine learning algorithms do not handle imbalanced data well and are likely to have a hard time predicting minority classes.
• Not all machine learning algorithms are built for multiclass classification at all.
• Many evaluation metrics need to be reformulated to describe multiclass predictions.

When you have multiple classes in your data, it is possible to formulate the multiclass problem in two ways. With one approach, any given observation can belong to multiple classes. With the other approach, an observation can belong to one and only one class. We will be sticking to the second, “one class per observation” model formulation in this section.

There are many different ways to deal with imbalanced data. We will demonstrate one of the simplest methods, downsampling, where observations from the majority classes are removed during training to achieve a balanced class distribution. We will be using the themis add-on package for recipes which provides the step_downsample() function to perform downsampling.

The themis package provides many more algorithms to deal with imbalanced data.

We have to create a new recipe specification from scratch, since we are dealing with new training data this time. The specification multicomplaints_rec is similar to what we created in Section 7.1. The only changes are that different data is passed to the data argument in the recipe() function (it is now multicomplaints_train) and we have added step_downsample(product) to the end of the recipe specification to downsample after all the text preprocessing. We want to downsample last so that we still generate features on the full training dataset. The downsampling will then only affect the modeling step, not the preprocessing steps, with hopefully better results.

library(themis)

multicomplaints_rec <-
recipe(product ~ consumer_complaint_narrative,
data = multicomplaints_train
) %>%
step_tokenize(consumer_complaint_narrative) %>%
step_stopwords(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 500) %>%
step_tfidf(consumer_complaint_narrative) %>%
step_downsample(product)

We also need a new cross-validation object since we are using a different dataset.

multicomplaints_folds <- vfold_cv(multicomplaints_train)

We can reuse the support vector machine specification from Section 7.3 to create a new workflow object with the new recipe specification. The SVM algorithm is specified for binary classification, but extensions have been made to generalize it to multiclass cases. The liquidSVM method will automatically detect that we are performing multiclass classification and switch to the appropriate case.

multi_svm_wf <- workflow() %>%

multi_svm_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_downsample()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

Notice that you don’t have to specify anything to perform multiclass classification in this case. The modeling packages will infer this from the number of classes in the outcome variable.

Now we have everything we need for fit_resamples() to fit all the models. Note that we specify save_pred = TRUE, so we can create a confusion matrix later. This is especially beneficial for multiclass classification. We again specify the metric_set() since liquidSVM doesn’t support class probabilities.

multi_svm_rs <- fit_resamples(
multi_svm_wf,
multicomplaints_folds,
metrics = metric_set(accuracy),
control = control_resamples(save_pred = TRUE)
)

multi_svm_rs

Let’s again extract the relevant results using collect_metrics() and collect_predictions()

multi_svm_rs_metrics <- collect_metrics(multi_svm_rs)
multi_svm_rs_predictions <- collect_predictions(multi_svm_rs)

What do we see, in terms of performance metrics?

multi_svm_rs_metrics
## # A tibble: 1 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.692    10 0.00354

The accuracy metric naturally extends to multiclass tasks, but it appears quite low at 69.2%, significantly lower than for the binary case in Section 7.3. This is expected since multiclass classification is a harder task than binary classification. In binary classification, there is one right answer and one wrong answer; in this case, there is one right answer and eight wrong answers.

To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices in Figure 7.4.

multi_svm_rs_predictions %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))

The diagonal is fairly well populated, which is a good sign. This means that the model generally predicted the right class. The off-diagonals numbers are all the failures and where we should direct our focus. It is a little hard to see these cases well since the majority class affects the scale. A trick to deal with this problem is to remove all the correctly predicted observations.

multi_svm_rs_predictions %>%
filter(id == "Fold01") %>%
filter(.pred_class != product) %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))

Now we can more clearly see where our model breaks down in Figure 7.5. One of the most common errors is “Credit reporting, credit repair services, or other personal consumer reports” complaints being wrongly being predicted as “Credit card or prepaid card” complaints. That is not hard to understand since both deal with credit and do have overlap in vocabulary. Knowing what the problem is helps us figure out how to improve our model. The first step for improving our model is to revisit the data preprocessing steps and model selection. We can look at different models or model engines that might be able to more easily separate the classes. The svm_rbf() model has a cost argument that determines the penalization of wrongly predicted classes that might be worth looking at.

Now that we have an idea of where the model isn’t working, we can look more closely at the data to create features that could distinguish between these classes. In Section 7.7 will we demonstrate how you can create custom features.

## 7.5 Case study: including non-text data

We are building a model from a dataset that includes more than text data alone. Annotations and labels have been added by the CFPB that we can use during modeling, but we need to ensure that only information that would be available at the time of prediction is included in the model. Otherwise we we will be very disappointed once our model is used to predict on new data! The variables we identify as available for use as predictors are:

• date_received
• issue
• sub_issue
• consumer_complaint_narrative
• company
• state
• zip_code
• tags
• submitted_via

Let’s try including date_received in our modeling, along with the text variable consumer_complaint_narrative and tags. The submitted_via variable could have been a viable candidate, but all the entries are “web”. The other variables like ZIP code could be of use too, but they are categorical variables with many values so we will exclude them for now.

more_vars_rec <-
recipe(product ~ date_received + tags + consumer_complaint_narrative,
data = complaints_train
)

How should we preprocess the date_received variable? We can use the step_date() function to extract the month and day of the week ("dow"). Then we remove the original date variable and convert the new month and day-of-the-week columns to indicator variables with step_dummy().

Categorical variables like the month can be stored as strings or factors, but for some kinds of models, they must be converted to indicator or dummy variables. These are numeric binary variables for the levels of the original categorical variable. For example, a variable called December would be created that is all zeroes and ones specifying which complaints were submitted in December, plus a variable called November, a variable called October, and so on.

more_vars_rec <- more_vars_rec %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_dummy(has_role("dates"))

The tags variable has some missing data. We can deal with this by using step_unknown(), which adds a new level to this factor variable for cases of missing data. Then we “dummify” (create dummy/indicator variables) the variable with step_dummy()

more_vars_rec <- more_vars_rec %>%
step_unknown(tags) %>%
step_dummy(tags)

Now we add steps to process the text of the complaints, as before.

more_vars_rec <- more_vars_rec %>%
step_tokenize(consumer_complaint_narrative) %>%
step_stopwords(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 500) %>%
step_tfidf(consumer_complaint_narrative)

Let’s combine this more extensive preprocessing recipe that handles more variables together with the support vector machine model specification.

more_vars_wf <- workflow() %>%

more_vars_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 9 Recipe Steps
##
## ● step_date()
## ● step_rm()
## ● step_dummy()
## ● step_unknown()
## ● step_dummy()
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

Let’s fit this workflow() to our resampled datasets and estimate accuracy, sensitivity, and specificity.

set.seed(123)
more_vars_rs <- fit_resamples(
more_vars_wf,
complaints_folds,
metrics = metric_set(accuracy, sensitivity, specificity)
)

We can extract the metrics from these results with collect_metrics().

more_vars_metrics <- collect_metrics(more_vars_rs)

How did these three performance metrics turn out for our model that included more than just the text data?

more_vars_metrics
## # A tibble: 3 x 6
##   .metric  .estimator  mean     n std_err .config
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>
## 1 accuracy binary     0.862    10 0.00124 Preprocessor1_Model1
## 2 sens     binary     0.865    10 0.00120 Preprocessor1_Model1
## 3 spec     binary     0.859    10 0.00227 Preprocessor1_Model1

We see here that including more predictors improves our model performance. With only text features in Section 7.3, we achieved an accuracy of 0.862; now by including the features dealing with dates and tags as well, our accuracy is 0.862.

This whole book focuses on supervised machine learning for text data, but models can combine both text predictors and other kinds of predictors.

## 7.6 Case study: data censoring

The complaints dataset already has sensitive information (PII) censored or protected using strings such as “XXXX” and “XX”. This data censoring can be viewed as data annotation; specific account numbers and birthdays are protected but we know they were there. These values would be mostly unique anyway, and likely filtered out in their original form.

Figure 7.6 shows the most frequent trigrams (Section 2.2.3) in our training dataset.

library(tidytext)

complaints_train %>%
slice(1:1000) %>%
unnest_tokens(trigrams, consumer_complaint_narrative,
token = "ngrams",
collapse = FALSE
) %>%
count(trigrams, sort = TRUE) %>%
mutate(censored = str_detect(trigrams, "xx")) %>%
slice(1:20) %>%
ggplot(aes(n, reorder(trigrams, n), fill = censored)) +
geom_col() +
scale_fill_manual(values = c("grey40", "firebrick")) +
labs(y = "Trigrams", x = "Count")

The vast majority of trigrams in Figure 7.6 include one or more censored words. Not only do the most used trigrams include some kind of censoring, but the censoring itself is informative as it is not used uniformly across the product classes. In Figure 7.7, we take the top 25 most frequent trigrams that include censoring, and plot the proportions for “Credit” and “Other”.

top_censored_trigrams <- complaints_train %>%
slice(1:1000) %>%
unnest_tokens(trigrams, consumer_complaint_narrative,
token = "ngrams",
collapse = FALSE
) %>%
count(trigrams, sort = TRUE) %>%
filter(str_detect(trigrams, "xx")) %>%
slice(1:25)

plot_data <- complaints_train %>%
unnest_tokens(trigrams, consumer_complaint_narrative,
token = "ngrams",
collapse = FALSE
) %>%
right_join(top_censored_trigrams, by = "trigrams") %>%
count(trigrams, product, .drop = FALSE)

plot_data %>%
ggplot(aes(n, trigrams, fill = product)) +
geom_col(position = "fill")

There is a difference in these proportions across classes. Tokens like “on xx xx” and “of xx xx” are used when referencing a date, e.g., “we had a problem on 06/25/2018”. Remember that the current tokenization engine strips punctuation before tokenizing. This means that the above example will be turned into “we had a problem on 06 25 2018” before creating n-grams8.

To crudely simulate what the data might look like before it was censored, we can replace all cases of “XX” and “XXXX” with random integers. This isn’t quite right since dates will be given values between 00 and 99 and we don’t know for sure that only numerals have been censored, but it gives us a place to start. Below is a simple function uncensor_vec() that locates all instances of "XX" and replaces them with a number between 11 and 99. We don’t need to handle the special case of XXXX as it automatically being handled.

uncensor <- function(n) {
as.character(sample(seq(10^(n - 1), 10^n - 1), 1))
}

uncensor_vec <- function(x) {
locs <- str_locate_all(x, "XX")

map2_chr(x, locs, ~ {
for (i in seq_len(nrow(.y))) {
str_sub(.x, .y[i, 1], .y[i, 2]) <- uncensor(2)
}
.x
})
}

We can run a quick test to see how it works.

uncensor_vec("In XX/XX/XXXX I leased a XXXX vehicle")
## [1] "In 11/82/4458 I leased a 1169 vehicle"

Now we can produce the same visualization as Figure 7.6 but also applying our uncensoring function to the text before tokenizing.

complaints_train %>%
slice(1:1000) %>%
mutate(text = uncensor_vec(consumer_complaint_narrative)) %>%
unnest_tokens(trigrams, text,
token = "ngrams",
collapse = FALSE
) %>%
count(trigrams, sort = TRUE) %>%
mutate(censored = str_detect(trigrams, "xx")) %>%
slice(1:20) %>%
ggplot(aes(n, reorder(trigrams, n), fill = censored)) +
geom_col() +
scale_fill_manual(values = c("grey40", "firebrick")) +
labs(y = "Trigrams", x = "Count")

Here in Figure 7.8, we see the same trigrams that appeared in Figure 7.6. However, none of the uncensored words appear, because of our uncensoring function. This is expected, because while "xx xx 2019" appears in the first plot indicating a date in the year 2019, after we uncensor it, it is split into 365 buckets (actually more, since we used numerical values between 00 and 99). Censoring the dates in these complaints gives more power to a date as a general construct.

What happens when we use these censored dates as a feature in supervised machine learning? We have a higher chance of understanding if dates in the complaint text are important to predicting the class, but we are blinded to the possibility that certain dates and months are more important.

Data censoring can be a form of preprocessing in your data pipeline. For example, it is highly unlikely to be useful (or ethical/legal) to have any specific person’s social security number, credit card number, or any other kind of PII embedded into your model. Such values appear rarely and are most likely highly correlated with other known variables in your dataset. More importantly, that information can become embedded in your model and begin to leak as demonstrated by Carlini et al. (2018), Fredrikson et al. (2014), and Fredrikson, Jha, and Ristenpart (2015). Both of these issues are important, and one of them could land you in a lot of legal trouble. Exposing such PII to modeling is an example of where we should all stop to ask, “Should we even be doing this?” as we discussed in the foreword to these chapters.

If you have social security numbers in text data, you should definitely not pass them on to your machine learning model, but you may consider the option of annotating the presence of a social security number. Since a social security number has a very specific form, we can easily construct a regular expression (Appendix 11) to locate them.

A social security number comes in the form AAA-BB-CCCC where AAA is a number between 001 and 899 excluding 666, BB is a number between 01 and 99 and CCCC is a number between 0001 and 9999. This gives us the following regex:

(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}

We can use a function to replace each social security number with an indicator that can be detected later by preprocessing steps. It’s a good idea to use a “word” that won’t be accidentally broken up by a tokenizer.

ssn_text <- c(
"My social security number is 498-08-6333",
"No way, mine is 362-60-9159",
"My parents numbers are 575-32-6985 and 576-36-5202"
)

ssn_pattern <- "(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}"

str_replace_all(
string = ssn_text,
pattern = ssn_pattern,
replacement = "ssnindicator"
)
## [1] "My social security number is ssnindicator"
## [2] "No way, mine is ssnindicator"
## [3] "My parents numbers are ssnindicator and ssnindicator"

This technique isn’t useful only for personally identifiable information but can be used anytime you want to gather similar words in the same bucket; hashtags, email addresses, and usernames can sometimes benefit from being annotated in this way.

The practice of data re-identification or de-anonymization, where seemingly or partially “anonymized” datasets are mined to identify individuals, is out of scope for this section and our book. However, this is a significant and important issue for any data practitioner dealing with PII and we encourage readers to familiarize themselves with results such as Sweeney (2000), and current best practices to protect against such mining.

## 7.7 Case study: custom features

Most of what we have looked at so far has boiled down to counting tokens and weighting them in one way or another. This approach is quite broad and domain agnostic, but you as a data practitioner often have specific knowledge about your dataset that you should use in feature engineering. Your domain knowledge allows you to build more predictive features than the naive search of simple tokens. As long as you can reasonably formulate what you are trying to count, chances are you can write a function that can detect it. This is where having a little bit of regular expressions knowledge pays off.

The textfeatures package includes functions to extract useful features from text, from the number of digits to the number of second person pronouns and more. These features can be used in textrecipes data preprocessing with the step_textfeature() function.

Your specific domain knowledge may provide specific guidance about feature engineering for text. Such custom features can be simple such as the number of URLs or the number of punctuation marks. They can also be more engineered such as the percentage of capitalization, whether the text ends with a hashtag, or whether two people’s names are both mentioned in a document.

For our CFPB complaints data, certain patterns have not adequately been picked up by our model so far, such as the data censoring and the curly bracket annotation for monetary amounts that we saw in Section 7.1. Let’s walk through how to create data preprocessing functions to build the features to:

• detect credit cards,
• calculate percentage censoring, and
• detect monetary amounts.

### 7.7.1 Detect credit cards

A credit card number is represented as four groups of four capital Xs in this dataset. Since the data is fairly well processed we are fairly sure that spacing will not be an issue and all credit cards will be represented as “XXXX XXXX XXXX XXXX”. A first naive attempt may be to use str_detect() with “XXXX XXXX XXXX XXXX” to find all the credit cards.

It is a good idea to create a small example regular expression where you know the answer, and then prototype your function before moving to the main dataset.

We start by creating a vector with two positives, one negative, and one potential false positive. The last string is more tricky since it has the same shape as a credit card but has one too many groups.

credit_cards <- c(
"my XXXX XXXX XXXX XXXX balance, and XXXX XXXX XXXX XXXX.",
"card with number XXXX XXXX XXXX XXXX.",
"at XX/XX 2019 my first",
"live at XXXX XXXX XXXX XXXX XXXX SC"
)

str_detect(credit_cards, "XXXX XXXX XXXX XXXX")
## [1]  TRUE  TRUE FALSE  TRUE

As we feared, the last vector was falsely detected to be a credit card. Sometimes you will have to accept a certain number of false positives and/or false negatives, depending on the data and what you are trying to detect. In this case, we can make the regex a little more complicated to avoid that specific false positive. We need to make sure that the word coming before the X’s doesn’t end in a capital X and the word following the last X doesn’t start with a capital X. We place spaces around the credit card and use some negated character classes (Appendix 11.3) to detect anything BUT a capital X.

str_detect(credit_cards, "[^X] XXXX XXXX XXXX XXXX [^X]")
## [1]  TRUE FALSE FALSE FALSE

Hurray! This fixed the false positive. But it gave us a false negative in return. Turns out that this regex doesn’t allow the credit card to be followed by a period since it requires a space. We can fix this with an alteration to match for a period or a space and a non-X.

str_detect(credit_cards, "[^X] +XXXX XXXX XXXX XXXX(\\.| [^X])")
## [1]  TRUE  TRUE FALSE FALSE

Now that we have a regular expression we are happy with we can wrap it up in a function we can use. We can extract the presence of a credit card with str_detect() and the number of credit cards with str_count().

creditcard_indicator <- function(x) {
str_detect(x, "[^X] +XXXX XXXX XXXX XXXX(\\.| [^X])")
}

creditcard_count <- function(x) {
str_count(x, "[^X] +XXXX XXXX XXXX XXXX(\\.| [^X])")
}

creditcard_indicator(credit_cards)
## [1]  TRUE  TRUE FALSE FALSE
creditcard_count(credit_cards)
## [1] 2 1 0 0

### 7.7.2 Calculate percentage censoring

Some of the complaints contain a high proportion of censoring, and we can build a feature to measure the percentage of the text that is censored.

There are often many ways to get to the same solution when working with regular expressions.

Let’s attack this problem by counting the number of X’s in each string, then count the number of alphanumeric characters and divide the two to get a percentage.

str_count(credit_cards, "X")
## [1] 32 16  4 20
str_count(credit_cards, "[:alnum:]")
## [1] 44 30 17 28
str_count(credit_cards, "X") / str_count(credit_cards, "[:alnum:]")
## [1] 0.7272727 0.5333333 0.2352941 0.7142857

We can finish up by creating a function.

procent_censoring <- function(x) {
str_count(x, "X") / str_count(x, "[:alnum:]")
}

procent_censoring(credit_cards)
## [1] 0.7272727 0.5333333 0.2352941 0.7142857

### 7.7.3 Detect monetary amounts

We have already constructed a regular expression that detects the monetary amount from the text in Section 7.1, so now we can look at how to use this information. Let’s start by creating a little example and see what we can extract.

dollar_texts <- c(
"That will be {$20.00}", "{$3.00}, {$2.00} and {$7.00}",
"I have no money"
)

str_extract_all(dollar_texts, "\\{\\$[0-9\\.]*\\}") ## [[1]] ## [1] "{$20.00}"
##
## [[2]]
## [1] "{$3.00}" "{$2.00}" "{$7.00}" ## ## [[3]] ## character(0) We can create a function that simply detects the dollar amount, and we can count the number of times each amount appears. Each occurrence also has a value, so it would be nice to include that information as well, such as the mean, minimum, or maximum. First, let’s extract the number from the strings. We could write a regular expression for this, but the parse_number() function from the readr package does a really good job of pulling out numbers. str_extract_all(dollar_texts, "\\{\\$[0-9\\.]*\\}") %>%
map(readr::parse_number)
## [[1]]
## [1] 20
##
## [[2]]
## [1] 3 2 7
##
## [[3]]
## numeric(0)

Now that we have the numbers we can iterate over them with the function of our choice. Since we are going to have texts with no monetary amounts, we need to handle the case with zero numbers. Defaults for some functions with vectors of length zero can be undesirable; we don’t want -Inf to be a value. Let’s extract the maximum value and give cases with no monetary amounts a maximum of zero.

max_money <- function(x) {
str_extract_all(x, "\\{\\$[0-9\\.]*\\}") %>% map(readr::parse_number) %>% map_dbl(~ ifelse(length(.x) == 0, 0, max(.x))) } max_money(dollar_texts) ## [1] 20 7 0 Now that we have created some feature engineering functions, we can use them to (hopefully) make our classification model better. ## 7.8 Case study: feature hashing The models we have created so far have used tokenization (Chapter 2) to split apart text data into tokens that are meaningful to us as human beings (words, bigrams) and then weighted these tokens by simple counts with word frequencies or weighted counts with tf-idf. A problem with these methods is that the output space is vast and dynamic. We could easily have more than 10,000 features in our training set, and we may run into computational problems with memory or long processing times. Deciding how many tokens to include can become a trade-off between computational time and information. This style of approach also doesn’t let us take advantage of new tokens we didn’t see in our training data. One method that has gained popularity in the machine learning field is the hashing trick. This method addresses many of the challenges outlined above and is very fast with a low memory footprint. Let’s start with the basics of feature hashing. First proposed by Weinberger et al. (2009), feature hashing was introduced as a dimensionality reduction method with a simple premise. We being with a hashing function which we then apply to our tokens. A hashing function takes input of variable size and maps it to output of a fixed size. Hashing functions are commonly used in cryptography. We will use the hashFunction package to illustrate the behavior of hashing functions. Suppose we have many country names in a character vector. We can apply the hashing function to each of the country names to project them into an integer space defined by the hashing function. We will use the 32-bit version of MurmurHash3 (Appleby 2008) here. Hashing functions are typically very fast and have certain properties. For example, the output of a hash function is expected to be uniform, with the whole output space filled evenly. The “avalanche effect” describes how similar strings are hashed in such a way that their hashes are not similar in the output space. library(hashFunction) countries <- c( "Palau", "Luxembourg", "Vietnam", "Guam", "Argentina", "Mayotte", "Bouvet Island", "South Korea", "San Marino", "American Samoa" ) map_int(countries, murmur3.32) Since MurmurHash uses 32 bits, the number of possible values is 2^32 = 4294967296, which is admittedly not much of an improvement over ten country names. Let’s take the modulo of these big integer values to project them down to a more manageable space. map_int(countries, murmur3.32) %% 24 Now we can use these values as indices when creating a matrix. This method is very fast; both the hashing and modulo can be performed independently for each input since neither need information about the full corpus. Since we are reducing the space, there is a chance that multiple words are hashed to the same value. This is called a collision and at first glance, it seems like would be a big problem for a model. However, research finds that using feature hashing has roughly the same accuracy as a simple bag-of-words model and the effect of collisions is quite minor (Forman and Kirshenbaum 2008). Another step that is taken to avoid the negative effects of hash collisions is to use a second hashing function that returns 1 and -1. This determines if we are adding or subtracting the index we get from the first hashing function. Suppose both the words “outdoor” and “pleasant” hash to the integer value 583. Without the second hashing they would collide to 2. Using signed hashing, we have a 50% chance that they will cancel each other out, which tries to stop one feature from growing too much. There are downsides to using feature hashing. Feature hashing: • still has one tuning parameter, and • cannot be reversed. The number of buckets you have correlates with computation speed and collision rate which in turn affects performance. It is your job to find the output that best suits your needs. Increasing the number of buckets will decrease the collision rate but will, in turn, return a larger output dataset which increases model fitting time. The number of buckets is tunable in tidymodels using the tune package. Perhaps the more important downside to using feature hashing is that the operation can’t be reversed. We are not able to detect if a collision occurs and it is difficult to understand the effect of any word in the model. Remember that we are left with n columns of hashes (not tokens), so if we find that the 274th column is a highly predictive feature, we cannot know in general which tokens contribute to that column. We cannot directly connect model values to words or tokens at all. We could go back to our training set and create a paired list of the tokens and what hashes they map to. Sometimes we might find only one token in that list, but it may have two (or three or four or more!) different tokens contributing. This feature hashing method is used because of its speed and scalability, not because it is interpretable. Feature hashing on tokens is available in tidymodels using the step_texthash() step from textrecipes. complaints_hash <- recipe(product ~ consumer_complaint_narrative, data = complaints_train) %>% step_tokenize(consumer_complaint_narrative) %>% step_texthash(consumer_complaint_narrative, signed = TRUE, num_terms = 512) %>% prep() %>% bake(new_data = NULL) dim(complaints_hash) ## [1] 87911 513 There are many columns in the results. Let’s take a glimpse() at the first ten columns. complaints_hash %>% select(consumer_complaint_narrative_hash001:consumer_complaint_narrative_hash010) %>% glimpse() ## Rows: 87,911 ## Columns: 10 ##$ consumer_complaint_narrative_hash001 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $consumer_complaint_narrative_hash002 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,… ##$ consumer_complaint_narrative_hash003 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -1…
## $consumer_complaint_narrative_hash004 <dbl> -1, 0, 0, 0, -3, 0, 0, 0, 0, 0, … ##$ consumer_complaint_narrative_hash005 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1…
## $consumer_complaint_narrative_hash006 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,… ##$ consumer_complaint_narrative_hash007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -6…
## $consumer_complaint_narrative_hash008 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,… ##$ consumer_complaint_narrative_hash009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## \$ consumer_complaint_narrative_hash010 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

By using step_texthash() we can quickly generate machine-ready data with a consistent number of variables. This typically results in a slight loss of performance compared to using a traditional bag-of-words representation. An example of this loss is illustrated in this textrecipes blogpost.

### 7.8.1 Text normalization

When working with text, you will inevitably run into problems with encodings and related irregularities. These kinds of problems have a significant influence on feature hashing. Consider the German word “schön.” The o with an umlaut (two dots over it) is a fairly simple character but it can be represented in a couple of different ways. We can either use a single character \U00f6 to represent the letter with umlaut. Or we can use two characters, one for the o and one character to denote the presence of two dots over the previous character \U0308

s1 <- "sch\U00f6n"
s2 <- "scho\U0308n"

These two strings will print the same for us as human readers.

s1
## [1] "schön"
s2
## [1] "schön"

However, they are not equal.

s1 == s2
## [1] FALSE

This poses a problem for the avalanche effect, which is needed for feature hashing to perform correctly. The avalanche effect will results in these two words (which should be identical) hashing to completely different values.

murmur3.32(s1)
murmur3.32(s2)

We can deal with this problem by performing text normalization on our text before feeding it into our preprocessing engine. One library to perform text normalization is the stringi package which includes many different text normalization methods. How these methods work is beyond the scope of this book, but know that the text normalization functions make text like our two versions of “schön” equivalent. We will use stri_trans_nfc() for this example, which performs Canonical Decomposition, followed by Canonical Composition.

library(stringi)

stri_trans_nfc(s1) == stri_trans_nfc(s2)

murmur3.32(stri_trans_nfc(s1))
murmur3.32(stri_trans_nfc(s2))

Now we see that the strings are equal after normalization.

This issue of text normalization can be important even if you don’t use feature hashing in your machine learning.

Since these words are encoded in different ways, they will be counted separately when we are counting token frequencies. Representing what should be a single token in multiple ways will split the counts. This will introduce noise in the best case, and in worse cases, some tokens will fall below the cutoff when we select tokens, leading to a loss of potentially informative words.

Luckily this is easily addressed by using stri_trans_nfc() on our text columns before starting preprocessing.

## 7.9 What evaluation metrics are appropriate?

We have focused on using accuracy and ROC AUC as metrics for our classification models so far, along with sensitivity and specificity in Section 7.3. These are not the only classification metrics available and your choice will often depend on how much you care about false positives compared to false negatives.

If you know before you fit your model that you want to compute one or more metrics, you can specify them in a call to metric_set(). Let’s set up a tuning grid for two new classification metrics, recall and precision.

nb_rs <- fit_resamples(
nb_wf,
complaints_folds,
metrics = metric_set(recall, precision)
)

If you have already fit your model, you can still compute and explore non-default metrics as long as you saved the predictions for your resampled datasets using control_resamples(save_pred = TRUE).

Let’s go back to the naive Bayes model we tuned in Section 7.1.1, with predictions stored in nb_rs_predictions. We can compute the overall recall.

nb_rs_predictions %>%
recall(product, .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.935

We can also compute the recall for each resample using group_by().

nb_rs_predictions %>%
group_by(id) %>%
recall(product, .pred_class)
## # A tibble: 10 x 4
##    id     .metric .estimator .estimate
##    <chr>  <chr>   <chr>          <dbl>
##  1 Fold01 recall  binary         0.937
##  2 Fold02 recall  binary         0.939
##  3 Fold03 recall  binary         0.931
##  4 Fold04 recall  binary         0.939
##  5 Fold05 recall  binary         0.937
##  6 Fold06 recall  binary         0.934
##  7 Fold07 recall  binary         0.931
##  8 Fold08 recall  binary         0.930
##  9 Fold09 recall  binary         0.948
## 10 Fold10 recall  binary         0.926

Many of the metrics used for classification are functions of the true positive, true negative, false positive, and false negative rates. The confusion matrix, the contingency table of observed classes and predicted classes, gives us information on these rates directly.

nb_rs_predictions %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class)
##           Truth
## Prediction Credit Other
##     Credit   3913  2210
##     Other     263  2406

It is possible with many datasets to achieve high accuracy just by predicting the majority class all the time, but such a model is not useful in the real world. Accuracy alone is often not a good way to assess the performance of classification models.

For the full set of classification metric options, see the yardstick documentation.

## 7.10 The full game: classification

We have come a long way from our first classification model in Section 7.1.1 and it is time to see how we can use what we have learned to improve it. We started this chapter with a simple naive Bayes model and n-gram token counts. Since then have we looked at different models, preprocessing techniques, and domain-specific feature engineering. For our final model, let’s use some of the domain-specific features we developed in Section 7.7 along with n-grams, as well as the non-text features. Our model doesn’t have any significant hyperparameters to tune but we will tune the number of tokens to include. For this final model we will:

• train on the same set of cross-validation resamples used throughout this chapter,
• include both text as well as tags and date features,
• tune the number of tokens used in the model,
• include trigrams, bigrams, and unigrams,
• include custom-engineered features,
• remove the Snowball stop word lexicon, and
• finally evaluate on the testing set, which we have not touched at all yet.

### 7.10.1 Feature selection

We start by creating a new preprocessing recipe. Let’s use the same predictors and handle date_received and tags in the same way.

complaints_rec_v2 <-
recipe(product ~ date_received + tags + consumer_complaint_narrative,
data = complaints_train
) %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_dummy(has_role("dates")) %>%
step_unknown(tags) %>%
step_dummy(tags)

After exploring this text data more in Section 7.7, we want to add these custom features to our final model. To do this, we use step_textfeature() to compute custom text features. We create a list of the custom text features and pass this list to step_textfeature() via the extract_functions argument. Note how we have to take a copy of consumer_complaint_narrative using step_mutate() as step_textfeature() consumes the column.

extract_funs <- list(
creditcard_count = creditcard_count,
procent_censoring = procent_censoring,
max_money = max_money
)

complaints_rec_v2 <- complaints_rec_v2 %>%
step_mutate(narrative_copy = consumer_complaint_narrative) %>%
step_textfeature(narrative_copy, extract_functions = extract_funs)

The tokenization and stop word removal will be similar to the other models in this chapter, but this time we’ll include trigrams, bigrams, and unigrams in the model. In our original model, we only included 500 tokens; for our final model, let’s treat the number of tokens as a hyperparameter that we vary when we tune the final model. Let’s also set the min_times argument to 50, to throw away tokens that appear less than 50 times in the entire corpus. We want our model to be robust and a token needs to appear enough times before we include it.

This dataset has many more than 50 of the most common 2000 tokens, but it can still be good practice to specify min_times to be safe. Your choice for min_times should depend on your data and how robust you need your model to be.

complaints_rec_v2 <- complaints_rec_v2 %>%
step_tokenize(consumer_complaint_narrative) %>%
step_stopwords(consumer_complaint_narrative) %>%
step_ngram(consumer_complaint_narrative, num_tokens = 3, min_num_tokens = 1) %>%
step_tokenfilter(consumer_complaint_narrative,
max_tokens = tune(), min_times = 250
) %>%
step_tfidf(consumer_complaint_narrative)

### 7.10.2 Specify the model

We use the Support Vector Machine model since it performed well in Section 7.3. We can reuse parts of the old workflow and update the recipe specification.

svm_wf_v2 <- svm_wf %>%
update_recipe(complaints_rec_v2)

svm_wf_v2
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 12 Recipe Steps
##
## ● step_date()
## ● step_rm()
## ● step_dummy()
## ● step_unknown()
## ● step_dummy()
## ● step_mutate()
## ● step_textfeature()
## ● step_tokenize()
## ● step_stopwords()
## ● step_ngram()
## ● ...
## ● and 2 more steps.
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

Let’s create a grid of possible hyperparameter values using grid_regular() from the dials package. With levels = 5, we have five possible values to try for the maximum number of tokens to include in the model.

param_grid <- grid_regular(max_tokens(range = c(500, 2000)),
levels = 5
)

param_grid
## # A tibble: 5 x 1
##   max_tokens
##        <int>
## 1        500
## 2        875
## 3       1250
## 4       1625
## 5       2000

Now it’s time to set up our tuning grid. Let’s save the predictions so we can explore them in more detail, and let’s also set the same custom metrics for this SVM model.

set.seed(2020)
tune_rs <- tune_grid(
svm_wf_v2,
complaints_folds,
grid = param_grid,
metrics = metric_set(accuracy, sensitivity, specificity),
control = control_resamples(save_pred = TRUE)
)

We have fitted these classification models!

### 7.10.3 Evaluate the modeling

Now that the tuning is finished, we can take a look at the best performing results.

show_best(tune_rs, metric = "accuracy")
## # A tibble: 5 x 7
##   max_tokens .metric  .estimator  mean     n std_err .config
##        <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
## 1       1625 accuracy binary     0.863    10 0.00198 Recipe4
## 2       1250 accuracy binary     0.862    10 0.00155 Recipe3
## 3       2000 accuracy binary     0.859    10 0.00435 Recipe5
## 4        875 accuracy binary     0.859    10 0.00171 Recipe2
## 5        500 accuracy binary     0.856    10 0.00167 Recipe1

We see that 1625 tokens is a good middle ground before we start to overfit. We can extract the best hyperparameter and use it to finalize the workflow.

best_accuracy <- select_best(tune_rs, "accuracy")

svm_wf_final <- finalize_workflow(
svm_wf_v2,
best_accuracy
)

svm_wf_final
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 12 Recipe Steps
##
## ● step_date()
## ● step_rm()
## ● step_dummy()
## ● step_unknown()
## ● step_dummy()
## ● step_mutate()
## ● step_textfeature()
## ● step_tokenize()
## ● step_stopwords()
## ● step_ngram()
## ● ...
## ● and 2 more steps.
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Computational engine: liquidSVM

The svm_wf_final workflow now has a finalized value for max_tokens.

The test set is a precious resource that can only be used to estimate performance of your final model on new data. We did not use the test set to compare or tune models, but we use it now that our model is finalized.

We can use the function last_fit() to fit our model to our training data and evaluate our model on our testing data.

final_res <- svm_wf_final %>%
last_fit(complaints_split, metrics = metric_set(accuracy))

Let’s explore our results using collect_metrics() and collect_predictions().

final_res_metrics <- collect_metrics(final_res)
final_res_predictions <- collect_predictions(final_res)

How does the final model performance on the testing data look?

final_res_metrics
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.868

This is pretty good, and we note that this result is close to what we saw from the training dataset. This is an indication that our model is not overfit and is able to generalize well enough.

The confusion matrix on the testing data in Figure 7.9 also yields pleasing results. It appears symmetric with a strong presence on the diagonal, showing that there isn’t any strong bias towards either of the classes.

final_res_predictions %>%
conf_mat(truth = product, estimate = .pred_class) %>%
autoplot(type = "heatmap")

## 7.11 Summary

You can use classification modeling to predict labels or categorical variables from a dataset, including datasets that include text. Naive Bayes models can perform well with text data since each feature is handled independently and thus large numbers of features are computational feasible. This is important as bag-of-word text models can involve thousands of tokens. We also saw that support vector machine models perform well for text data. Your own domain knowledge about your text data is incredibly valuable, and using that knowledge in careful engineering of custom features can improve your model.

### 7.11.1 In this chapter, you learned:

• how text data can be used in a classification model
• to tune hyperparameters in the data preprocessing stage
• how to compare different model types
• that models can combine both text and non-text predictors
• how feature hashing can be used as a fast alternative to bag-of-words
• about engineering custom features for machine learning
• about performance metrics for classification models

### References

Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2018. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks.” http://arxiv.org/abs/1802.08232.

Forman, George, and Evan Kirshenbaum. 2008. “Extremely Fast Text Feature Extraction for Classification and Indexing.” In Proceedings of the 17th Acm Conference on Information and Knowledge Management, 1221–30. CIKM ’08. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1458082.1458243.

Frank, Eibe, and Remco R. Bouckaert. 2006. “Naive Bayes for Text Classification with Unbalanced Classes.” In Knowledge Discovery in Databases: PKDD 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 503–10. Berlin, Heidelberg: Springer Berlin Heidelberg.

Fredrikson, Matthew, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” In 23rd USENIX Security Symposium (USENIX Security 14), 17–32. San Diego, CA: USENIX Association. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew.

Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. 2015. “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures.” In, 1322–33. https://doi.org/10.1145/2810103.2813677.

Joachims, Thorsten. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, 137–42. ECML’98. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/BFb0026683.

Kibriya, Ashraf M., Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. 2005. “Multinomial Naive Bayes for Text Categorization Revisited.” In AI 2004: Advances in Artificial Intelligence, edited by Geoffrey I. Webb and Xinghuo Yu, 488–99. Berlin, Heidelberg: Springer Berlin Heidelberg.

Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng. 2006. “Some Effective Techniques for Naive Bayes Text Classification.” IEEE Transactions on Knowledge and Data Engineering 18 (11): 1457–66.

Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Health (San Francisco) 671 (2000): 1–34.

Van-Tu, Nguyen, and Le Anh-Cuong. 2016. “Improving Question Classification by Feature Extraction and Selection.” Indian Journal of Science and Technology 9 (May). https://doi.org/10.17485/ijst/2016/v9i17/93160.

Weinberger, Kilian, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ICML ’09. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1553374.1553516.

1. The censored trigrams that include “oh” seem mysterious but upon closer examination, they come from censored addresses, with “oh” representing the US state of Ohio. Most two-letter state abbreviations are censored but this one is not, since it is ambiguous. This highlights the real challenge of anonymizing text.↩︎