Foreword

In Chapters 6 and 7, we use algorithms such as regularized linear models, support vector machines, and naive Bayes models to predict outcomes from predictors including text data. Deep learning models approach the same tasks and have the same goals, but the algorithms involved are different. Deep learning models are “deep” in the sense that they use multiple layers to learn how to map from input features to output outcomes; this is in contrast to the kinds of models we used in the previous two chapters which use a shallow (single) mapping.

Deep learning models can be effective for text prediction problems because they use these multiple layers to capture complex relationships in language.

The layers in a deep learning model are connected in a network and these models are called neural networks, although they do not work much like a human brain. The layers can be connected in different configurations called network architectures, which sometimes incorporate word embeddings, as described in Chapter 5. We will cover three network architectures in the following chapters:

  • Chapter 8 starts our exploration of deep learning for text with a densely connected neural network. Think of this more straightforward architecture as a bridge between the “shallow” learning approaches of Chapters 6 and 7 that treated text as a bag of words and the more complex architectures to come.
  • Chapter 9 continues by walking through how to train and evaluate a more advanced architecture, a long short-term memory (LSTM) network. LSTMs are among the most common architectures used for text data because they model text as a long sequence of words or characters.
  • Chapter 10 wraps up our treatment of deep learning for text with the convolutional neural network (CNN) architecture. CNNs are another advanced architecture appropriate for text data because they can capture specific local patterns.

Our discussion of network architectures is fairly specific to text data; in other situations you may do best using a different architecture, for example, when working with dense, tabular data.

For the following chapters, we will use tidymodels packages along with Tensorflow and the R interface to Keras (Allaire and Chollet 2020) for preprocessing, modeling, and evaluation.

The keras R package provides an interface for R users to Keras, a high-level API for building neural networks.

Table 7.1 presents some key differences between deep learning and what, in this book, we call machine learning methods.

TABLE 7.1: Comparing deep learning with other machine learning methods
Machine learning Deep learning
Faster to train Takes more time to train
Software is typically easier to install Software can be more challenging to install
Can achieve good performance with less data Requires more data for good performance
Depends on preprocessing to model more than very simple relationships Can model highly complex relationships

Deep learning and more traditional machine learning algorithms are different, but the structure of the modeling process is largely the same, no matter what the specific details of prediction or algorithm are.

Spending your data budget

A limited amount of data is available for any given modeling project, and this data must be allocated to different tasks to balance competing priorities. We espouse an approach of first splitting data in testing and training sets, holding the testing set back until all modeling tasks are completed, including feature engineering and tuning. This testing set is then used as a final check on model performance, to estimate how the final model will perform on new data.

The training data is available for tasks from model parameter estimation to determining which features are important and more. To compare or tune model options or parameters, this training set can be further split so that models can be evaluated on a validation set, or it can be resampled as described in Section 6.1.2 to create new simulated data sets for the purpose of evaluation.

Feature engineering

Text data requires extensive processing to be appropriate for modeling, whether via an algorithm like regularized regression or a neural network. Chapters 1 through 5 covered several of the most important techniques that are used to transform language into a representation appropriate for computation. This feature engineering part of the modeling process can be intensive for text, sometimes more computationally expensive than the fitting a model algorithm.

We espouse an approach of implementing feature engineering on training data only, typically using resampled data sets, to avoid obtaining an overly optimistic estimate of model performance. Feature engineering can sometimes be a part of the machine learning process where subtle data leakage occurs, when practitioners use information (say, to preprocess data or engineer features) that will not be available at prediction time. One example of this is tf-idf, which we introduced in Chapter 5 and used in both Chapters 6 and 7. As a reminder, the term frequency of a word is how frequently a word occurs in a document, and the inverse document frequency of a word is a weight, typically defined as:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

These two quantities are multiplied together to compute a term’s tf-idf, a statistic that measures the frequency of a term adjusted for how rarely it is used. Computing inverse document frequency involves the whole corpus or collection of documents. When you are fitting a model, how should that corpus be defined? We strongly advise that you should use your training data only in such a situation. Using all the data available to you (training plus testing) involves leakage of information from the testing set to the model, and any estimates you may make of how your model may perform on new data may be overly optimistic.

This specific example focused on tf-idf, but data leakage is a serious challenge in general for building machine learning models and systems. The tidymodels framework is designed to encourage good statistical practice, such as learning feature engineering transformations from training data and then applying those transformation to othe data sets.

Fitting and tuning

Many different kinds of models are appropriate for text data, from more straightforward models like the linear models explored deeply in Chapter 6 to the neural network models we cover in Chapters 10 and 9. Some of these models have hyperparameters which cannot be learned from data during fitting, like the regularization parameter of the models in Chapter 6; these hyperparameters can be tuned using resampled data sets.

Model evaluation

Once models are trained and perhaps tuned, we can evaluate their performance quantitatively using metrics appropriate for the kind of practical problem being dealt with. Model explanation analysis, such as feature importance, also helps us understand how well and why models are behaving the way they do.

Putting the model process in context

This outline of the model process depends on us as data practitioners coming prepared for modeling with a healthy understanding of our data sets from exploratory data analysis. Silge and Robinson (2017) provide a guide for exploratory data analysis for text.

Also, in practice, the structure of a real modeling project is iterative. After fitting and tuning a first model or set of a models, a practitioner will often return to build more or better features, then refit models, and evaluate in a more detailed way. Notice that we take this approach in each chapter, both for more straightforward machine learning and deep learning; we start with a simpler model and then go back again and again to improve it in several ways. This iterative approach is healthy and appropriate, as long as good practices in data “spending” are observed. The testing set cannot be used during this iterative back-and-forth, and using resampled data sets can set us as practitioners up for more accurate estimates of performance.